[英]How to use the summarise function to create a summary in R using dplyr package?
[英]How to summarise by group AND get a summary of the overall dataset using dplyr in R
我想計算不同組的摘要並同時計算整個(未分組)數據集的摘要,最好使用 dplyr(或非常適合 dplyr 管道的東西)。
可以通過分別計算組摘要,然后是整體摘要,然后加入結果來實現所需的結果。 然而,這似乎有點低效,我希望有一個更簡單的解決方案,需要更少的代碼重復。 我在文檔或其他問題中沒有找到與此相關的任何內容。
可重現的數據:
library(tidyverse)
set.seed(500)
dat <-
data_frame(treatment = sample(c("Group1", "Group2", "Group3"), 100, replace = TRUE),
recruitment_strategy = sample(c("Strategy 1", "Strategy 2", "Strategy 3", "Strategy 4", "Strategy 5"), 100, replace = TRUE),
Variable_A = rnorm(100),
Variable_B = rnorm(100),
Variable_C = rnorm(100))
按組計算多個變量的均值和整個數據集的均值的代碼:
count_by_group <- dat %>%
group_by(treatment) %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>%
select(-n) %>%
spread(treatment, `n (%)`)
count_overall <- dat %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>%
select(-n) %>%
rename(Overall_dataset = `n (%)`)
left_join(count_by_group, count_overall)
所需的 output 是通過上述代碼實現的:每組均值表,整體均值旁邊:
variable Group1 Group2 Group3 Overall_dataset
<chr> <dbl> <dbl> <dbl> <dbl>
1 Variable_A -0.154 0.0385 0.263 0.0351
2 Variable_B 0.212 -0.232 -0.124 -0.0671
3 Variable_C -0.195 0.194 0.0508 0.0376
對分類變量進行類似的過程,以獲取每個組以及整個數據集的計數和百分比:
count_by_group <- dat %>%
group_by(treatment) %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>% # calculate percentage in the desired format for table
select(-n) %>%
spread(treatment, `n (%)`)
count_overall <- dat %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>% # calculate percentage in the desired format for table
select(-n) %>%
rename(Overall_dataset = `n (%)`)
left_join(count_by_group, count_overall)
recruitment_strategy Group1 Group2 Group3 Overall_dataset
<chr> <chr> <chr> <chr> <chr>
1 Strategy 1 2 (6%) 13 (30%) 4 (16%) 19 (19%)
2 Strategy 2 8 (26%) 6 (14%) 6 (24%) 20 (20%)
3 Strategy 3 6 (19%) 12 (27%) 3 (12%) 21 (21%)
4 Strategy 4 9 (29%) 4 (9%) 5 (20%) 18 (18%)
5 Strategy 5 6 (19%) 9 (20%) 7 (28%) 22 (22%)
是否有一種解決方案可以在一個步驟中獲得分組摘要和總體摘要,而不是要求分配兩個單獨的對象,然后將它們加入第三個 object?
這是我將如何重寫您的代碼。
管道有一個技巧來使用.
將 LHS 放在 RHS 上的多個位置。 這使您無需分配中間對象即可進行連接。 我還使用了更多步驟來實現不同的清晰度平衡而不是重復自己,例如在count()
中進行所有分組並使用其name
參數,使用mutate_at
在連接后進行所有格式化,並使用str_glue
和scales::percent
使字符串格式更具可讀性。
所有這些在某種程度上都是一個偏好問題,但我認為避免中間分配(以及必須命名所述對象的負擔)可以通過以下方法解決。
library(tidyverse)
set.seed(500)
dat <- tibble(
treatment = sample(c("Group1", "Group2", "Group3"), 100, replace = TRUE),
recruitment_strategy = sample(c("Strategy 1", "Strategy 2", "Strategy 3", "Strategy 4", "Strategy 5"), 100, replace = TRUE),
Variable_A = rnorm(100),
Variable_B = rnorm(100),
Variable_C = rnorm(100)
)
dat %>%
inner_join(
x = count(., treatment, recruitment_strategy) %>% spread(treatment, n),
y = count(., recruitment_strategy, name = "Overall_dataset"),
by = "recruitment_strategy"
) %>%
mutate_at(
.vars = vars(-recruitment_strategy),
.funs = ~ str_glue("{.} ({scales::percent(. / sum(.), accuracy = 1)})")
)
#> # A tibble: 5 x 5
#> recruitment_strategy Group1 Group2 Group3 Overall_dataset
#> <chr> <glue> <glue> <glue> <glue>
#> 1 Strategy 1 2 (6%) 13 (30%) 4 (16%) 19 (19%)
#> 2 Strategy 2 8 (26%) 6 (14%) 6 (24%) 20 (20%)
#> 3 Strategy 3 6 (19%) 12 (27%) 3 (12%) 21 (21%)
#> 4 Strategy 4 9 (29%) 4 (9%) 5 (20%) 18 (18%)
#> 5 Strategy 5 6 (19%) 9 (20%) 7 (28%) 22 (22%)
由代表 package (v0.3.0) 於 2019 年 11 月 10 日創建
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.