简体   繁体   English

R dplyr:分组汇总,同时保留其他非数值列

[英]R dplyr: Group and summarize while retaining other non-numeric columns

I want to calculate grouped means of multiple columns in a dataframe. In the process, I will want to retain non-numeric columns that don't vary across with the grouping variable.我想计算 dataframe 中多列的分组均值。在此过程中,我将希望保留不随分组变量变化的非数字列。 Here's a simple example.这是一个简单的例子。

library(dplyr) 

#create data frame
df <- data.frame(team=c('A', 'A', 'B', 'B', 'B', 'C', 'C'),
        state=c('Michigan', 'Michigan', 'Michigan', 'Michigan', 'Michigan','AL', 'AL'),
        region=c('Midwest', 'Midwest', 'Midwest', 'Midwest', 'Midwest', 'South', 'South'),
                 pts=c(5, 8, 14, 18, 5, 7, 7),
                 rebs=c(8, 8, 9, 3, 8, 7, 4),
        ast=c(8,6,7,5,3,0,9))

The resulting data field:结果数据字段:

> df
  team    state  region pts rebs ast
1    A Michigan Midwest   5    8   8
2    A Michigan Midwest   8    8   6
3    B Michigan Midwest  14    9   7
4    B Michigan Midwest  18    3   5
5    B Michigan Midwest   5    8   3
6    C  Alabama   South   7    7   0
7    C  Alabama   South   7    4   9

Summarizing by mean with 'team' as the grouping variable is straightforward enough:用“团队”作为分组变量按均值进行总结非常简单:

> df %>%
+   group_by(team) %>%
+   summarise_at(vars(pts, rebs, ast), list(mean))
# A tibble: 3 × 4
  team    pts  rebs   ast
  <chr> <dbl> <dbl> <dbl>
1 A       6.5  8      7  
2 B      12.3  6.67   5  
3 C       7    5.5    4.5

But how do I retain those other ID columns (that don't change across within-team stats).但是我如何保留那些其他 ID 列(在团队内部统计数据中不会改变)。 In other words, how do I get the following:换句话说,我如何获得以下内容:

  team  state     region     pts  rebs   ast
  <chr> <chr>     <chr>     <dbl> <dbl> <dbl>
1 A     Michigan   Midwest    6.5  8      7  
2 B     Michigan   Midwest   12.3  6.67   5  
3 C     Alabama    South      7    5.5    4.5

Thanks!!谢谢!!

I would advise using all the columns that you need to retain inside the group_by() verb because of the following reasons:由于以下原因,我建议使用您需要保留在group_by()动词中的所有列:

If these columns vary you need to select one of these different values and this will force you to use some function for that.如果这些列不同,您需要 select 这些不同值之一,这将迫使您为此使用一些 function。

If they are equal the group_by() verb will be enough.如果它们相等,则group_by()动词就足够了。

df %>%
  group_by(team, state, region) %>%
  summarise_at(vars(pts, rebs, ast), list(mean))

Using data.table approach使用data.table方法

setDT(df)
vars = c("pts", "rebs", "ast")
df[, (vars) := lapply(.SD, mean, na.rm = T), .SDcols = vars, by = "team"][, .SD[1], by = "team"]

Output: Output:

team    state  region      pts     rebs ast
1:    A Michigan Midwest  6.50000 8.000000 7.0
2:    B Michigan Midwest 12.33333 6.666667 5.0
3:    C       AL   South  7.00000 5.500000 4.5 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM