[英]dplyr: summarize long format for multiple groups
我知道有很多問題,可能聽起來像是這樣或那樣,但我無法找到我的確切問題的答案。
讓我們說我們有一個玩具數據集
library(tidyverse)
df <- tibble(
Gender = c("m", "f", "f", "m", "m",
"f", "f", "f", "m", "f"),
IQ = rnorm(10, 100, 15),
Other = runif(10),
Test = rnorm(10),
group2 = c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B")
)
從中我們想要計算gender
和group2
mean
, min
和max
。
僅限一組,我可以輕松寫
df %>%
group_by(Gender) %>%
select_if(is.numeric) %>%
gather(Variable, Value, -Gender) %>%
group_by(Variable, Gender) %>%
summarise(mean = mean(Value),
min = min(Value),
max = max(Value)) %>%
ungroup()
要得到
Variable Gender mean min max
<chr> <chr> <dbl> <dbl> <dbl>
1 IQ f 99.2 81.9 121.
2 IQ m 89.0 62.5 106.
3 Other f 0.301 0.187 0.479
4 Other m 0.395 0.0483 0.757
5 Test f -0.0770 -1.18 0.545
6 Test m 0.163 -0.632 0.828
但我無法弄清楚,如何為多個群體做同樣的事情。 我知道我可以像這樣使用summarise_*()
df %>%
group_by(Gender) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max))
但它返回一個寬格式(如data.table
)
Gender IQ_mean Other_mean Test_mean IQ_min Other_min Test_min IQ_max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 99.2 0.301 -0.0770 81.9 0.187 -1.18 121.
2 m 89.0 0.395 0.163 62.5 0.0483 -0.632 106.
# … with 2 more variables: Other_max <dbl>, Test_max <dbl>
當你有10個以上的變量時,這幾乎沒用。
我在這里錯過了什么?
您可以通過添加gather
, separate
和spread
到您自己的代碼來實現:
df %>%
group_by(Gender, group2) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max)) %>%
gather(vars, vals, -Gender, -group2) %>%
separate(vars, c("Variable", "stat")) %>%
spread(stat, vals)
#### OUTPUT ####
# A tibble: 12 x 6
# Groups: Gender [2]
Gender group2 Variable max mean min
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 f A IQ 110. 103. 95.0
2 f A Other 0.934 0.469 0.00439
3 f A Test 1.39 0.472 -0.446
4 f B IQ 121. 92.0 75.6
5 f B Other 0.730 0.461 0.261
6 f B Test 0.589 0.276 -0.524
7 m A IQ 112. 104. 94.3
8 m A Other 0.827 0.613 0.308
9 m A Test 0.724 0.136 -0.264
10 m B IQ 115. 115. 115.
11 m B Other 0.970 0.970 0.970
12 m B Test -1.05 -1.05 -1.05
您可以先通過在單個變量列中收集IQ
, Other
和Test
將df
轉換為長格式,然后計算每個組的摘要統計信息(Gender-group2-variable):
library(tidyverse)
set.seed(1)
## data
df <- tibble(
Gender = c("m", "f", "f", "m", "m",
"f", "f", "f", "m", "f"),
IQ = rnorm(10, 100, 15),
Other = runif(10),
Test = rnorm(10),
group2 = c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B")
)
df %>%
gather(key = "variable", value = "value", -c(Gender, group2)) %>%
group_by(Gender, group2, variable) %>%
summarize_at("value", list(mean = mean, min = min, max = max)) %>%
ungroup()
#> # A tibble: 12 x 6
#> Gender group2 variable mean min max
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 f A IQ 95.1 87.5 103.
#> 2 f A Other 0.432 0.212 0.652
#> 3 f A Test 0.464 -0.0162 0.944
#> 4 f B IQ 100. 87.7 111.
#> 5 f B Other 0.281 0.0134 0.386
#> 6 f B Test 0.599 0.0746 0.919
#> 7 m A IQ 106. 90.6 124.
#> 8 m A Other 0.442 0.126 0.935
#> 9 m A Test 0.457 -0.0449 0.821
#> 10 m B IQ 109. 109. 109.
#> 11 m B Other 0.870 0.870 0.870
#> 12 m B Test -1.99 -1.99 -1.99
這是一個data.table
方法
library( data.table )
melt( setDT(df),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
# Gender group2 variable max min mean
# 1: m A IQ 120.739562935 83.46037366 96.99412720
# 2: f A IQ 98.657598754 98.43677811 98.54718843
# 3: f B IQ 111.973534436 71.38605822 94.04719457
# 4: m B IQ 102.913093964 102.91309396 102.91309396
# 5: m A Other 0.861929066 0.51651983 0.66098944
# 6: f A Other 0.752484881 0.07648229 0.41448359
# 7: f B Other 0.463524836 0.18308752 0.33301693
# 8: m B Other 0.099740011 0.09974001 0.09974001
# 9: m A Test 1.159379020 -0.83569116 0.04268551
# 10: f A Test -0.009017293 -0.77245300 -0.39073515
# 11: f B Test 1.591132150 -0.99248570 -0.24997246
# 12: m B Test 1.654489766 1.65448977 1.65448977
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.498788 1.819936 1.997320 1.980358 2.218809 2.413124 10
# tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816 10
# tidyverse2 4.952477 5.185053 6.303103 6.001478 6.902558 9.663341 10
microbenchmark::microbenchmark(
data.table = {
DT <- copy(df)
melt( setDT(DT),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
},
tidyverse1 = {
DT <- copy(df)
df %>%
group_by(Gender, group2) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max)) %>%
gather(vars, vals, -Gender, -group2) %>%
separate(vars, c("Variable", "stat")) %>%
spread(stat, vals)
},
tidyverse2 = {
df %>%
gather(key = "variable", value = "value", -c(Gender, group2)) %>%
group_by(Gender, group2, variable) %>%
summarize_at("value", list(mean = mean, min = min, max = max)) %>%
ungroup()
},
times = 10
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.