[英]for loop with dplyr summarise returning different results than group_by
申請時我得到了奇怪的結果for
環路dplyr
不知道為什么或如何解決它-總結功能。
test <- data.frame(title = c("a", "b", "c","a","b","c", "a", "b", "c","a","b","c"),
category = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
sex = c("m", "m", "m", "f", "f", "f", "m", "m", "m", "f", "f", "f"),
salary = c(50,70,90,40,60,85, 220,270,350,180,200,330))
category_list <- unique(test$category)
tmp = list()
for (category in category_list) {
# Create an average salary line for the category
tmp[category] <- test %>%
filter(category == category) %>%
summarise(mean(salary))
print(tmp)
}
我得到這個作為輸出
$A
[1] 162.0833
$A
[1] 162.0833
$B
[1] 162.0833
其中, group_by()
函數返回適當的結果:
test %>% group_by(category) %>% summarise(mean(salary))
# A tibble: 2 x 2
category `mean(salary)`
<fct> <dbl>
1 A 65.8
2 B 258.
替換特定類別確實會返回適當的結果:
test %>%
filter(category == "A") %>%
summarise(mean(salary))
mean(salary)
1 65.83333
因此, category_list
對象可能有問題嗎? 令人驚訝的是,當我調用category_list
對象的第一個元素時,我也得到了適當的答案:
test %>%
+ filter(category == category_list[1]) %>%
+ summarise(mean(salary))
mean(salary)
1 65.83333
我想弄清楚(而不使用group_by
)的原因是因為我試圖制作一個腳本,該腳本將創建多個ggplot對象,然后將這些對象與gridExtra
庫合並。
也許我錯了,可以使用group_by
但是我想到的唯一方法是使用以下偽代碼:
category
創建均值列表,以在geom_hline()
參數中使用 category
對數據幀對象進行子集化,每個子集將在ggplot中使用其geom_hline()
category
創建一個繪圖對象列表 grid.arrange()
從gridExtra
文庫的外側for
循環到每個情節結合在一起 到目前為止,這是我的代碼(無法正常工作):
library(gridExtra)
p = list()
avg_line = list()
tmp = list()
category_data = data.frame()
for (category in category_list) {
# Create an average salary line for the category
tmp[[category]] <- test %>%
filter(category == category) %>%
summarise(mean(salary))
avg_line[[category]] <- tmp[[2]]
# Subset data frame on category
category_data[[category]] <- test %>% filter(category == category)
# Make plots for each category
p[[category]] <-
ggplot(category_data[[category]], aes(x = title, y = salary)) +
geom_line(color = "white") +
geom_point(aes(color =sex)) +
scale_color_manual(values = c("#F49171", "#81C19C")) +
geom_hline(yintercept = avg_line[[category]], color = "white", alpha = 0.6, size = 1) +
theme(legend.position = "none",
panel.background = element_rect(color = "#242B47", fill = "#242B47"),
plot.background = element_rect(color = "#242B47", fill = "#242B47"),
axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
axis.text = element_text(family = "Georgia", color = "white"),
axis.text.x = element_text(angle = 90),
# Get rid of the y- and x-axis titles
axis.title.y=element_blank(),
axis.title.x=element_blank(),
panel.grid.major.y = element_line(color = "grey48", size = 0.05),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())
}
grid.arrange(grobs = p, nrow = 1)
我想要的輸出是這樣的:
for循環中的問題是語句filter(category == category)
。 總是如此,因為這兩次都從數據中提取category
。 如果您確實需要for循環,只需在for循環中重命名迭代器即可。
但是,您根本不需要grid.arrange
。 facet_wrap
會為您提供所需的確切信息(您可能需要對facet標簽進行一些重新格式化,這些操作使用以strip
開頭的主題元素進行控制):
category_means <- test %>%
group_by(category) %>%
summarize_at(vars(salary), mean)
p <- test %>%
# group_by(category) %>%
ggplot(aes(x = title, y = salary, color = sex)) +
facet_wrap(~ category, nrow = 1, scales = "free_y") +
geom_line(color = 'white') +
geom_point() +
scale_color_manual(values = c("#F49171", "#81C19C")) +
geom_hline(data = category_means, aes(yintercept = salary), color = 'white', alpha = 0.6, size = 1) +
theme(legend.position = "none",
panel.background = element_rect(color = "#242B47", fill = "#242B47"),
plot.background = element_rect(color = "#242B47", fill = "#242B47"),
axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
axis.text = element_text(family = "Georgia", color = "white"),
axis.text.x = element_text(angle = 90),
# Get rid of the y- and x-axis titles
axis.title.y=element_blank(),
axis.title.x=element_blank(),
panel.grid.major.y = element_line(color = "grey48", size = 0.05),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())
p
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.