按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

Question

我正在通过 Frank Harrell 的 R Flow course（ http://hbiostat.org/rflow/case.html ）中的泰坦尼克号研究和关于汇总数据的问题。 原始数据（Titanic5.csv）可以从https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 总结数据集如下：

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group

最后一条命令的结果是：

      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843

更有趣的是

d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]

这使

      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216

结果正是我想要的，但语法很大程度上取决于 data.table 的功能。 如何使用 dataframe 而不是数据表获得相同的结果，理想情况下使用基本 R，但也使用 dplyr？

真挚地

托马斯飞利浦

Answer 1

关于dplyr ，实现此目的最直接的方法可能是group_by()和summarize()的通常组合。 按Sex和Class汇总数据可以通过以下方式完成：

d %>%  
  group_by(Sex, Class) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups:   Sex [2]
  Sex    Class Proportion     N
  <chr>  <dbl>      <dbl> <int>
1 female     1      0.965   144
2 female     2      0.887   106
3 female     3      0.491   216
4 male       1      0.344   180
5 male       2      0.141   170
6 male       3      0.152   493

并且只是通过Sex总结：

d %>%  
  group_by(Sex) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
# A tibble: 2 × 3
  Sex    Proportion     N
  <chr>       <dbl> <int>
1 female      0.727   466
2 male        0.191   843

最后，这里还有一个使用stats的解决方案，我认为它足够接近“基础 R”（？）。 这不是最优雅的解决方案，即也许可以立即为变量分配相应的变量名称和函数，但效果很好：

summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))

summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")

summary_full

# Output
     Sex Class Proportion   N
1 female     1  0.9652778 144
2 female     2  0.8867925 106
3 female     3  0.4907407 216
4   male     1  0.3444444 180
5   male     2  0.1411765 170
6   male     3  0.1521298 493

Answer 2

带base R

aggregate(Survived ~ Sex , d ,
          \(x) c(Proportion = mean(x) , N = length(x)))

aggregate(Survived ~ Sex + Class, d , 
          \(x) c(Proportion = mean(x) , N = length(x)))

Output

#> first

    Sex Survived.Proportion  Survived.N
1 female           0.7274678    466
2   male           0.1909846    843

#> second

    Sex Class Survived.Proportion  Survived.N
1 female     1           0.9652778   144
2   male     1           0.3444444   180
3 female     2           0.8867925   106
4   male     2           0.1411765   170
5 female     3           0.4907407   216
6   male     3           0.1521298   493

按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

问题描述

2 个解决方案

解决方案1
0 2022-08-13 17:00:31

解决方案2
0 2022-08-13 17:22:46

按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

问题描述

2 个解决方案

解决方案1 0 2022-08-13 17:00:31

解决方案2 0 2022-08-13 17:22:46

解决方案1
0 2022-08-13 17:00:31

解决方案2
0 2022-08-13 17:22:46