[英]Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr
我正在通过 Frank Harrell 的 R Flow course( http://hbiostat.org/rflow/case.html )中的泰坦尼克号研究和关于汇总数据的问题。 原始数据(Titanic5.csv)可以从https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 总结数据集如下:
# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex] # .N= # obs in by group
最后一条命令的结果是:
sex Proportion N
1: female 0.7274678 466
2: male 0.1909846 843
更有趣的是
d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]
这使
sex class Proportion N
1: female 1 0.9652778 144
2: male 1 0.3444444 180
3: male 2 0.1411765 170
4: female 2 0.8867925 106
5: male 3 0.1521298 493
6: female 3 0.4907407 216
结果正是我想要的,但语法很大程度上取决于 data.table 的功能。 如何使用 dataframe 而不是数据表获得相同的结果,理想情况下使用基本 R,但也使用 dplyr?
真挚地
托马斯飞利浦
关于dplyr
,实现此目的最直接的方法可能是group_by()
和summarize()
的通常组合。 按Sex
和Class
汇总数据可以通过以下方式完成:
d %>%
group_by(Sex, Class) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups: Sex [2]
Sex Class Proportion N
<chr> <dbl> <dbl> <int>
1 female 1 0.965 144
2 female 2 0.887 106
3 female 3 0.491 216
4 male 1 0.344 180
5 male 2 0.141 170
6 male 3 0.152 493
并且只是通过Sex
总结:
d %>%
group_by(Sex) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
# A tibble: 2 × 3
Sex Proportion N
<chr> <dbl> <int>
1 female 0.727 466
2 male 0.191 843
最后,这里还有一个使用stats
的解决方案,我认为它足够接近“基础 R”(?)。 这不是最优雅的解决方案,即也许可以立即为变量分配相应的变量名称和函数,但效果很好:
summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))
summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")
summary_full
# Output
Sex Class Proportion N
1 female 1 0.9652778 144
2 female 2 0.8867925 106
3 female 3 0.4907407 216
4 male 1 0.3444444 180
5 male 2 0.1411765 170
6 male 3 0.1521298 493
base R
aggregate(Survived ~ Sex , d ,
\(x) c(Proportion = mean(x) , N = length(x)))
aggregate(Survived ~ Sex + Class, d ,
\(x) c(Proportion = mean(x) , N = length(x)))
#> first
Sex Survived.Proportion Survived.N
1 female 0.7274678 466
2 male 0.1909846 843
#> second
Sex Class Survived.Proportion Survived.N
1 female 1 0.9652778 144
2 male 1 0.3444444 180
3 female 2 0.8867925 106
4 male 2 0.1411765 170
5 female 3 0.4907407 216
6 male 3 0.1521298 493
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.