[英]Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr
我正在通過 Frank Harrell 的 R Flow course( http://hbiostat.org/rflow/case.html )中的泰坦尼克號研究和關於匯總數據的問題。 原始數據(Titanic5.csv)可以從https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 總結數據集如下:
# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex] # .N= # obs in by group
最后一條命令的結果是:
sex Proportion N
1: female 0.7274678 466
2: male 0.1909846 843
更有趣的是
d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]
這使
sex class Proportion N
1: female 1 0.9652778 144
2: male 1 0.3444444 180
3: male 2 0.1411765 170
4: female 2 0.8867925 106
5: male 3 0.1521298 493
6: female 3 0.4907407 216
結果正是我想要的,但語法很大程度上取決於 data.table 的功能。 如何使用 dataframe 而不是數據表獲得相同的結果,理想情況下使用基本 R,但也使用 dplyr?
真摯地
托馬斯飛利浦
關於dplyr
,實現此目的最直接的方法可能是group_by()
和summarize()
的通常組合。 按Sex
和Class
匯總數據可以通過以下方式完成:
d %>%
group_by(Sex, Class) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups: Sex [2]
Sex Class Proportion N
<chr> <dbl> <dbl> <int>
1 female 1 0.965 144
2 female 2 0.887 106
3 female 3 0.491 216
4 male 1 0.344 180
5 male 2 0.141 170
6 male 3 0.152 493
並且只是通過Sex
總結:
d %>%
group_by(Sex) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
# A tibble: 2 × 3
Sex Proportion N
<chr> <dbl> <int>
1 female 0.727 466
2 male 0.191 843
最后,這里還有一個使用stats
的解決方案,我認為它足夠接近“基礎 R”(?)。 這不是最優雅的解決方案,即也許可以立即為變量分配相應的變量名稱和函數,但效果很好:
summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))
summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")
summary_full
# Output
Sex Class Proportion N
1 female 1 0.9652778 144
2 female 2 0.8867925 106
3 female 3 0.4907407 216
4 male 1 0.3444444 180
5 male 2 0.1411765 170
6 male 3 0.1521298 493
base R
aggregate(Survived ~ Sex , d ,
\(x) c(Proportion = mean(x) , N = length(x)))
aggregate(Survived ~ Sex + Class, d ,
\(x) c(Proportion = mean(x) , N = length(x)))
#> first
Sex Survived.Proportion Survived.N
1 female 0.7274678 466
2 male 0.1909846 843
#> second
Sex Class Survived.Proportion Survived.N
1 female 1 0.9652778 144
2 male 1 0.3444444 180
3 female 2 0.8867925 106
4 male 2 0.1411765 170
5 female 3 0.4907407 216
6 male 3 0.1521298 493
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.