繁体   English   中英

按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

[英]Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr

我正在通过 Frank Harrell 的 R Flow course( http://hbiostat.org/rflow/case.html )中的泰坦尼克号研究和关于汇总数据的问题。 原始数据(Titanic5.csv)可以从https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 总结数据集如下:

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group

最后一条命令的结果是:

      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843

更有趣的是

d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]

这使

      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216

结果正是我想要的,但语法很大程度上取决于 data.table 的功能。 如何使用 dataframe 而不是数据表获得相同的结果,理想情况下使用基本 R,但也使用 dplyr?

真挚地

托马斯飞利浦

关于dplyr ,实现此目的最直接的方法可能是group_by()summarize()的通常组合。 SexClass汇总数据可以通过以下方式完成:

d %>%  
  group_by(Sex, Class) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups:   Sex [2]
  Sex    Class Proportion     N
  <chr>  <dbl>      <dbl> <int>
1 female     1      0.965   144
2 female     2      0.887   106
3 female     3      0.491   216
4 male       1      0.344   180
5 male       2      0.141   170
6 male       3      0.152   493

并且只是通过Sex总结:

d %>%  
  group_by(Sex) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
# A tibble: 2 × 3
  Sex    Proportion     N
  <chr>       <dbl> <int>
1 female      0.727   466
2 male        0.191   843

最后,这里还有一个使用stats的解决方案,我认为它足够接近“基础 R”(?)。 这不是最优雅的解决方案,即也许可以立即为变量分配相应的变量名称和函数,但效果很好:

summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))

summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")

summary_full

# Output
     Sex Class Proportion   N
1 female     1  0.9652778 144
2 female     2  0.8867925 106
3 female     3  0.4907407 216
4   male     1  0.3444444 180
5   male     2  0.1411765 170
6   male     3  0.1521298 493
  • base R
aggregate(Survived ~ Sex , d ,
          \(x) c(Proportion = mean(x) , N = length(x)))

aggregate(Survived ~ Sex + Class, d , 
          \(x) c(Proportion = mean(x) , N = length(x)))
  • Output
#> first

    Sex Survived.Proportion  Survived.N
1 female           0.7274678    466
2   male           0.1909846    843

#> second

    Sex Class Survived.Proportion  Survived.N
1 female     1           0.9652778   144
2   male     1           0.3444444   180
3 female     2           0.8867925   106
4   male     2           0.1411765   170
5 female     3           0.4907407   216
6   male     3           0.1521298   493

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM