[英]Using lapply over a list of data frames with grouping
我有一個包含兩個數據框的列表。
library(tidyverse)
dat <- list("seniors" = data.frame(NAME = c("Cletus", "Agnes", "Hank", "Sue", "Maude"),
COOL = c(0, 1, 1, 0, 1),
GENDER = c("Male", "Female", "Male", "Female", "Female"),
RACE = c("B", "B", "W", "W", "B")),
"juniors" = data.frame(NAME = c("Chester", "Chuck", "Bruce", "Carmen", "Cleo"),
COOL = c(1, 1, 1, 0, 1),
GENDER = c("Male", "Male", "Male", "Female", "Female"),
RACE = c("W", "W", "B", "W", "W")))
如果我想在兩個數據框中獲取特定分組變量的計數,例如gender
,按個人是否cool
分組,我可以使用以下代碼:
results <- lapply(names(dat), function(x) {
dat[[x]] %>%
group_by(COOL, GENDER) %>%
summarise(TOTAL = n()) %>%
mutate(COHORT = x) %>%
select(COHORT, everything())
})
do.call(rbind, results)
但是,我希望能夠在不重復代碼n
次的情況下獲得超過n
分組變量的計數,並將所有結果放在一個表中。 並不是說雖然我總是想按COOL
分組,但第二個分組變量是會改變的。
我想要的輸出如下(請注意TOTAL
數字不反映示例數據,我主要只是想顯示所需的表結構)。 另外,我認識到這個表結構不符合整潔原則,只是需要這樣才能在 Excel 中進行最終的查找。
COHORT COOL GROUP_VAR GROUP_VAL TOTAL
SENIORS 0 GENDER MALE 3
SENIORS 1 GENDER MALE 5
SENIORS 0 GENDER FEMALE 7
SENIORS 1 GENDER FEMALE 2
SENIORS 0 RACE B 2
SENIORS 1 RACE B 3
SENIORS 0 RACE W 7
SENIORS 1 RACE W 9
JUNIORS 0 GENDER MALE 3
JUNIORS 1 GENDER MALE 5
JUNIORS 0 GENDER FEMALE 3
JUNIORS 1 GENDER FEMALE 1
JUNIORS 0 RACE B 2
JUNIORS 1 RACE B 7
JUNIORS 0 RACE W 3
JUNIORS 1 RACE W 2
我嘗試將結果列表包裝在另一個帶有列名列表的 lapply 包裝器中(見下文),但這不起作用:
group_names <- list("GENDER", "RACE")
lapply(names(dat), function(x) {
lapply(names(group_names), function (y) {
dat[[x]] %>%
group_by(COOL, y) %>%
summarise(TOTAL = n()) %>%
mutate(COHORT = x,
GROUP = y) %>%
select(COHORT, everything())
})
})
有人知道我如何以優雅有效的方式做到這一點嗎?
謝謝!
您可以使用函數tibble::enframe()
將數據幀列表轉換為單個數據幀,您可以在其中應用分組過程。 根據dplyr::count()
的變量名稱,您可以指定分組變量:
library(dplyr)
library(tidyr)
library(tibble)
dat %>%
enframe("COHORT", "data") %>%
unnest(data) %>%
count(COHORT, COOL, GENDER, name="TOTAL")
# A tibble: 7 x 4
COHORT COOL GENDER TOTAL
<chr> <dbl> <fct> <int>
1 juniors 0 Female 1
2 juniors 1 Female 1
3 juniors 1 Male 3
4 seniors 0 Female 1
5 seniors 0 Male 1
6 seniors 1 Female 2
7 seniors 1 Male 1
這回答了你的問題了嗎?
==========================================
基於@DJC 評論,我在這里提出了一個更合適的解決方案:
dat %>%
enframe("COHORT", "data") %>%
unnest(data) %>%
gather(GROUP_VAR, GROUP_VAL, GENDER, RACE) %>%
count(COHORT, COOL, GROUP_VAR, GROUP_VAL, name="TOTAL")
# A tibble: 14 x 5
COHORT COOL GROUP_VAR GROUP_VAL TOTAL
<chr> <dbl> <chr> <chr> <int>
1 juniors 0 GENDER Female 1
2 juniors 0 RACE W 1
3 juniors 1 GENDER Female 1
4 juniors 1 GENDER Male 3
5 juniors 1 RACE B 1
6 juniors 1 RACE W 3
7 seniors 0 GENDER Female 1
8 seniors 0 GENDER Male 1
9 seniors 0 RACE B 1
10 seniors 0 RACE W 1
11 seniors 1 GENDER Female 2
12 seniors 1 GENDER Male 1
13 seniors 1 RACE B 2
14 seniors 1 RACE W 1
以下是解決問題的兩種方法:
1)這類似於您嘗試在每個列表中進行所有操作,然后最終綁定行。 我們在這里使用imap
,它傳遞列表的名稱和數據。
library(tidyverse)
imap_dfr(dat, ~.x %>%
pivot_longer(cols = c(GENDER, RACE)) %>%
count(COOL, name, value) %>%
mutate(COHORT = .y) %>%
select(COHORT, everything()))
# COHORT COOL name value n
# <chr> <dbl> <chr> <fct> <int>
# 1 seniors 0 GENDER Female 1
# 2 seniors 0 GENDER Male 1
# 3 seniors 0 RACE B 1
# 4 seniors 0 RACE W 1
# 5 seniors 1 GENDER Female 2
# 6 seniors 1 GENDER Male 1
# 7 seniors 1 RACE B 2
# 8 seniors 1 RACE W 1
# 9 juniors 0 GENDER Female 1
#10 juniors 0 RACE W 1
#11 juniors 1 GENDER Female 1
#12 juniors 1 GENDER Male 3
#13 juniors 1 RACE B 1
#14 juniors 1 RACE W 3
2) 對我來說,這是比 1) 更好的方法,因為在這里我們將所有列表的行組合在一起,並且只對完整的數據幀執行一次操作。 這也更短。
bind_rows(dat, .id = "COHORT") %>%
pivot_longer(cols = c(GENDER, RACE)) %>%
count(COHORT, COOL, name, value)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.