按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

Question

I'm working my way through the Titanic Study in Frank Harrell's R Flow course ( http://hbiostat.org/rflow/case.html ) and have a question about summarizing data.我正在通过 Frank Harrell 的 R Flow course（ http://hbiostat.org/rflow/case.html ）中的泰坦尼克号研究和关于汇总数据的问题。 The raw data (Titanic5.csv) can be downloaded from https://hbiostat.org/data/repo/titanic5.csv He uses data.table (d is a data.table) to summarize the dataset as follows:原始数据（Titanic5.csv）可以从https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 总结数据集如下：

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group

The result of the last command is:最后一条命令的结果是：

      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843

Even more interesting is更有趣的是

d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]

which gives这使

      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216

The results are exactly what I want,but the syntax depends very strongly on the capabilities of data.table.结果正是我想要的，但语法很大程度上取决于 data.table 的功能。 How can I get the same results using a dataframe instead of a data table, ideallly with base R, but also with dplyr?如何使用 dataframe 而不是数据表获得相同的结果，理想情况下使用基本 R，但也使用 dplyr？

Sincerely真挚地

Thomas Philips托马斯飞利浦

Answer 1

Concerning dplyr , the most straight-forward way for achieving this would probably be the usual combination of group_by() and summarize() .关于dplyr ，实现此目的最直接的方法可能是group_by()和summarize()的通常组合。 Summarizing the data by Sex and Class may be done in the following way:按Sex和Class汇总数据可以通过以下方式完成：

d %>%  
  group_by(Sex, Class) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups:   Sex [2]
  Sex    Class Proportion     N
  <chr>  <dbl>      <dbl> <int>
1 female     1      0.965   144
2 female     2      0.887   106
3 female     3      0.491   216
4 male       1      0.344   180
5 male       2      0.141   170
6 male       3      0.152   493

And just summarizing by Sex :并且只是通过Sex总结：

d %>%  
  group_by(Sex) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
# A tibble: 2 × 3
  Sex    Proportion     N
  <chr>       <dbl> <int>
1 female      0.727   466
2 male        0.191   843

Finally, here also a solutions using stats which I assume is close enough to "base R"(?).最后，这里还有一个使用stats的解决方案，我认为它足够接近“基础 R”（？）。 It's not the most elegant solution, ie maybe one can immediately assign the variables with respective variable names and functions, but it works well:这不是最优雅的解决方案，即也许可以立即为变量分配相应的变量名称和函数，但效果很好：

summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))

summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")

summary_full

# Output
     Sex Class Proportion   N
1 female     1  0.9652778 144
2 female     2  0.8867925 106
3 female     3  0.4907407 216
4   male     1  0.3444444 180
5   male     2  0.1411765 170
6   male     3  0.1521298 493

Answer 2

With base R带base R

aggregate(Survived ~ Sex , d ,
          \(x) c(Proportion = mean(x) , N = length(x)))

aggregate(Survived ~ Sex + Class, d , 
          \(x) c(Proportion = mean(x) , N = length(x)))

Output Output

#> first

    Sex Survived.Proportion  Survived.N
1 female           0.7274678    466
2   male           0.1909846    843

#> second

    Sex Class Survived.Proportion  Survived.N
1 female     1           0.9652778   144
2   male     1           0.3444444   180
3 female     2           0.8867925   106
4   male     2           0.1411765   170
5 female     3           0.4907407   216
6   male     3           0.1521298   493

按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

问题描述

2 个解决方案

解决方案1
0 2022-08-13 17:00:31

解决方案2
0 2022-08-13 17:22:46

按多列分组的汇总统计信息 dataframe vs. data.table vs. dplyr

问题描述

2 个解决方案

解决方案1 0 2022-08-13 17:00:31

解决方案2 0 2022-08-13 17:22:46

解决方案1
0 2022-08-13 17:00:31

解决方案2
0 2022-08-13 17:22:46