簡體   English   中英

按多列分組的匯總統計信息 dataframe vs. data.table vs. dplyr

[英]Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr

我正在通過 Frank Harrell 的 R Flow course( http://hbiostat.org/rflow/case.html )中的泰坦尼克號研究和關於匯總數據的問題。 原始數據(Titanic5.csv)可以從https://hbiostat.org/data/repo/titanic5.csv他使用 Z19B13BE0F34E065AB672B2037023305 總結數據集如下:

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group

最后一條命令的結果是:

      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843

更有趣的是

d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]

這使

      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216

結果正是我想要的,但語法很大程度上取決於 data.table 的功能。 如何使用 dataframe 而不是數據表獲得相同的結果,理想情況下使用基本 R,但也使用 dplyr?

真摯地

托馬斯飛利浦

關於dplyr ,實現此目的最直接的方法可能是group_by()summarize()的通常組合。 SexClass匯總數據可以通過以下方式完成:

d %>%  
  group_by(Sex, Class) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups:   Sex [2]
  Sex    Class Proportion     N
  <chr>  <dbl>      <dbl> <int>
1 female     1      0.965   144
2 female     2      0.887   106
3 female     3      0.491   216
4 male       1      0.344   180
5 male       2      0.141   170
6 male       3      0.152   493

並且只是通過Sex總結:

d %>%  
  group_by(Sex) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
# A tibble: 2 × 3
  Sex    Proportion     N
  <chr>       <dbl> <int>
1 female      0.727   466
2 male        0.191   843

最后,這里還有一個使用stats的解決方案,我認為它足夠接近“基礎 R”(?)。 這不是最優雅的解決方案,即也許可以立即為變量分配相應的變量名稱和函數,但效果很好:

summary_mn <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex + Class, data = as.data.frame(d), FUN = function(x) Nna(x))

summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")

summary_full

# Output
     Sex Class Proportion   N
1 female     1  0.9652778 144
2 female     2  0.8867925 106
3 female     3  0.4907407 216
4   male     1  0.3444444 180
5   male     2  0.1411765 170
6   male     3  0.1521298 493
  • base R
aggregate(Survived ~ Sex , d ,
          \(x) c(Proportion = mean(x) , N = length(x)))

aggregate(Survived ~ Sex + Class, d , 
          \(x) c(Proportion = mean(x) , N = length(x)))
  • Output
#> first

    Sex Survived.Proportion  Survived.N
1 female           0.7274678    466
2   male           0.1909846    843

#> second

    Sex Class Survived.Proportion  Survived.N
1 female     1           0.9652778   144
2   male     1           0.3444444   180
3 female     2           0.8867925   106
4   male     2           0.1411765   170
5 female     3           0.4907407   216
6   male     3           0.1521298   493

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM