簡體   English   中英

dplyr:匯總多個組的長格式

[英]dplyr: summarize long format for multiple groups

我知道有很多問題,可能聽起來像是這樣或那樣,但我無法找到我的確切問題的答案。

讓我們說我們有一個玩具數據集

library(tidyverse)
df <- tibble(
  Gender = c("m", "f", "f", "m", "m", 
             "f", "f", "f", "m", "f"),
  IQ = rnorm(10, 100, 15),
  Other = runif(10),
  Test = rnorm(10),
  group2 = c("A", "A", "A", "A", "A",
             "B", "B", "B", "B", "B")
)

從中我們想要計算gendergroup2 meanminmax

僅限一組,我可以輕松寫

df %>% 
  group_by(Gender) %>% 
  select_if(is.numeric) %>% 
  gather(Variable, Value, -Gender) %>% 
  group_by(Variable, Gender) %>% 
  summarise(mean = mean(Value), 
        min = min(Value), 
        max = max(Value)) %>% 
 ungroup()

要得到

 Variable Gender    mean     min     max
 <chr>    <chr>    <dbl>   <dbl>   <dbl>
1 IQ       f      99.2    81.9    121.   
2 IQ       m      89.0    62.5    106.   
3 Other    f       0.301   0.187    0.479
4 Other    m       0.395   0.0483   0.757
5 Test     f      -0.0770 -1.18     0.545
6 Test     m       0.163  -0.632    0.828

但我無法弄清楚,如何為多個群體做同樣的事情。 我知道我可以像這樣使用summarise_*()

df %>% 
  group_by(Gender) %>% 
  summarise_if(is.numeric, list(mean = mean, 
                                min = min, 
                                max = max)) 

但它返回一個寬格式(如data.table

  Gender IQ_mean Other_mean Test_mean IQ_min Other_min Test_min IQ_max
  <chr>    <dbl>      <dbl>     <dbl>  <dbl>     <dbl>   <dbl>  <dbl>
1 f         99.2      0.301   -0.0770   81.9    0.187   -1.18    121.
2 m         89.0      0.395    0.163    62.5    0.0483  -0.632   106.
# … with 2 more variables: Other_max <dbl>, Test_max <dbl>

當你有10個以上的變量時,這幾乎沒用。

我在這里錯過了什么?

您可以通過添加gatherseparatespread到您自己的代碼來實現:

df %>% 
    group_by(Gender, group2) %>% 
    summarise_if(is.numeric, list(mean = mean, 
                                  min = min, 
                                  max = max)) %>% 
    gather(vars, vals, -Gender, -group2) %>% 
    separate(vars, c("Variable", "stat")) %>% 
    spread(stat, vals)

#### OUTPUT ####

# A tibble: 12 x 6
# Groups:   Gender [2]
   Gender group2 Variable     max    mean       min
   <chr>  <chr>  <chr>      <dbl>   <dbl>     <dbl>
 1 f      A      IQ       110.    103.     95.0    
 2 f      A      Other      0.934   0.469   0.00439
 3 f      A      Test       1.39    0.472  -0.446  
 4 f      B      IQ       121.     92.0    75.6    
 5 f      B      Other      0.730   0.461   0.261  
 6 f      B      Test       0.589   0.276  -0.524  
 7 m      A      IQ       112.    104.     94.3    
 8 m      A      Other      0.827   0.613   0.308  
 9 m      A      Test       0.724   0.136  -0.264  
10 m      B      IQ       115.    115.    115.     
11 m      B      Other      0.970   0.970   0.970  
12 m      B      Test      -1.05   -1.05   -1.05   

您可以先通過在單個變量列中收集IQOtherTestdf轉換為長格式,然后計算每個組的摘要統計信息(Gender-group2-variable):

library(tidyverse)

set.seed(1)

## data
df <- tibble(
    Gender = c("m", "f", "f", "m", "m", 
        "f", "f", "f", "m", "f"),
    IQ = rnorm(10, 100, 15),
    Other = runif(10),
    Test = rnorm(10),
    group2 = c("A", "A", "A", "A", "A",
        "B", "B", "B", "B", "B")
)

df %>%
    gather(key = "variable", value = "value", -c(Gender, group2)) %>%
    group_by(Gender, group2, variable) %>%
    summarize_at("value", list(mean = mean, min = min, max = max)) %>%
    ungroup()
#> # A tibble: 12 x 6
#>    Gender group2 variable    mean      min     max
#>    <chr>  <chr>  <chr>      <dbl>    <dbl>   <dbl>
#>  1 f      A      IQ        95.1    87.5    103.   
#>  2 f      A      Other      0.432   0.212    0.652
#>  3 f      A      Test       0.464  -0.0162   0.944
#>  4 f      B      IQ       100.     87.7    111.   
#>  5 f      B      Other      0.281   0.0134   0.386
#>  6 f      B      Test       0.599   0.0746   0.919
#>  7 m      A      IQ       106.     90.6    124.   
#>  8 m      A      Other      0.442   0.126    0.935
#>  9 m      A      Test       0.457  -0.0449   0.821
#> 10 m      B      IQ       109.    109.     109.   
#> 11 m      B      Other      0.870   0.870    0.870
#> 12 m      B      Test      -1.99   -1.99    -1.99

這是一個data.table方法

library( data.table )
melt( setDT(df), 
  id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE), 
                                        min = min(value, na.rm = TRUE),
                                        mean = mean(value, na.rm = TRUE)),
                                    by = .(Gender, group2, variable )][]

#    Gender group2 variable           max          min         mean
# 1:      m      A       IQ 120.739562935  83.46037366  96.99412720
# 2:      f      A       IQ  98.657598754  98.43677811  98.54718843
# 3:      f      B       IQ 111.973534436  71.38605822  94.04719457
# 4:      m      B       IQ 102.913093964 102.91309396 102.91309396
# 5:      m      A    Other   0.861929066   0.51651983   0.66098944
# 6:      f      A    Other   0.752484881   0.07648229   0.41448359
# 7:      f      B    Other   0.463524836   0.18308752   0.33301693
# 8:      m      B    Other   0.099740011   0.09974001   0.09974001
# 9:      m      A     Test   1.159379020  -0.83569116   0.04268551
# 10:      f      A     Test  -0.009017293  -0.77245300  -0.39073515
# 11:      f      B     Test   1.591132150  -0.99248570  -0.24997246
# 12:      m      B     Test   1.654489766   1.65448977   1.65448977

基准

# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
# data.table  1.498788  1.819936  1.997320  1.980358  2.218809  2.413124    10
# tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816    10
# tidyverse2  4.952477  5.185053  6.303103  6.001478  6.902558  9.663341    10

microbenchmark::microbenchmark(
  data.table = {
    DT <- copy(df)
    melt( setDT(DT), 
          id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE), 
                                                min = min(value, na.rm = TRUE),
                                                mean = mean(value, na.rm = TRUE)),
                                            by = .(Gender, group2, variable )][]

  },
  tidyverse1 = {
    DT <- copy(df)
    df %>% 
      group_by(Gender, group2) %>% 
      summarise_if(is.numeric, list(mean = mean, 
                                    min = min, 
                                    max = max)) %>% 
      gather(vars, vals, -Gender, -group2) %>% 
      separate(vars, c("Variable", "stat")) %>% 
      spread(stat, vals)
  },
  tidyverse2 = {
    df %>%
      gather(key = "variable", value = "value", -c(Gender, group2)) %>%
      group_by(Gender, group2, variable) %>%
      summarize_at("value", list(mean = mean, min = min, max = max)) %>%
      ungroup()
  },
  times = 10 
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM