簡體   English   中英

使用 lapply 對數據框列表進行分組

[英]Using lapply over a list of data frames with grouping

我有一個包含兩個數據框的列表。

library(tidyverse)
dat <- list("seniors" = data.frame(NAME = c("Cletus", "Agnes", "Hank", "Sue", "Maude"),
                                   COOL = c(0, 1, 1, 0, 1),
                                   GENDER = c("Male", "Female", "Male", "Female", "Female"),
                                   RACE = c("B", "B", "W", "W", "B")), 
            "juniors" = data.frame(NAME = c("Chester", "Chuck", "Bruce", "Carmen", "Cleo"),
                                   COOL = c(1, 1, 1, 0, 1),
                                   GENDER = c("Male", "Male", "Male", "Female", "Female"),
                                   RACE = c("W", "W", "B", "W", "W")))

如果我想在兩個數據框中獲取特定分組變量的計數,例如gender ,按個人是否cool分組,我可以使用以下代碼:

results <- lapply(names(dat), function(x) {
  dat[[x]] %>% 
    group_by(COOL, GENDER) %>% 
    summarise(TOTAL = n()) %>%
    mutate(COHORT = x) %>% 
    select(COHORT, everything())
})
do.call(rbind, results)

但是,我希望能夠在不重復代碼n次的情況下獲得超過n分組變量的計數,並將所有結果放在一個表中。 並不是說雖然我總是想按COOL分組,但第二個分組變量是會改變的。

我想要的輸出如下(請注意TOTAL數字不反映示例數據,我主要只是想顯示所需的表結構)。 另外,我認識到這個表結構不符合整潔原則,只是需要這樣才能在 Excel 中進行最終的查找。

COHORT    COOL    GROUP_VAR    GROUP_VAL    TOTAL
SENIORS    0      GENDER       MALE         3
SENIORS    1      GENDER       MALE         5
SENIORS    0      GENDER       FEMALE       7
SENIORS    1      GENDER       FEMALE       2
SENIORS    0      RACE         B            2
SENIORS    1      RACE         B            3
SENIORS    0      RACE         W            7
SENIORS    1      RACE         W            9
JUNIORS    0      GENDER       MALE         3
JUNIORS    1      GENDER       MALE         5
JUNIORS    0      GENDER       FEMALE       3
JUNIORS    1      GENDER       FEMALE       1
JUNIORS    0      RACE         B            2
JUNIORS    1      RACE         B            7
JUNIORS    0      RACE         W            3
JUNIORS    1      RACE         W            2

我嘗試將結果列表包裝在另一個帶有列名列表的 lapply 包裝器中(見下文),但這不起作用:

group_names <- list("GENDER", "RACE")
lapply(names(dat), function(x) {
  lapply(names(group_names), function (y) {
      dat[[x]] %>% 
    group_by(COOL, y) %>% 
    summarise(TOTAL = n()) %>%
    mutate(COHORT = x,
           GROUP = y) %>% 
    select(COHORT, everything())
  })
})

有人知道我如何以優雅有效的方式做到這一點嗎?

謝謝!

您可以使用函數tibble::enframe()將數據幀列表轉換為單個數據幀,您可以在其中應用分組過程。 根據dplyr::count()的變量名稱,您可以指定分組變量:

library(dplyr)
library(tidyr)
library(tibble)

dat %>% 
  enframe("COHORT", "data") %>% 
  unnest(data) %>% 
  count(COHORT, COOL, GENDER, name="TOTAL")


# A tibble: 7 x 4
  COHORT   COOL GENDER TOTAL
  <chr>   <dbl> <fct>  <int>
1 juniors     0 Female     1
2 juniors     1 Female     1
3 juniors     1 Male       3
4 seniors     0 Female     1
5 seniors     0 Male       1
6 seniors     1 Female     2
7 seniors     1 Male       1

這回答了你的問題了嗎?

==========================================

基於@DJC 評論,我在這里提出了一個更合適的解決方案:

dat %>% 
  enframe("COHORT", "data") %>% 
  unnest(data) %>% 
  gather(GROUP_VAR, GROUP_VAL, GENDER, RACE) %>%
  count(COHORT, COOL, GROUP_VAR, GROUP_VAL, name="TOTAL")

# A tibble: 14 x 5
   COHORT   COOL GROUP_VAR GROUP_VAL TOTAL
   <chr>   <dbl> <chr>     <chr>     <int>
 1 juniors     0 GENDER    Female        1
 2 juniors     0 RACE      W             1
 3 juniors     1 GENDER    Female        1
 4 juniors     1 GENDER    Male          3
 5 juniors     1 RACE      B             1
 6 juniors     1 RACE      W             3
 7 seniors     0 GENDER    Female        1
 8 seniors     0 GENDER    Male          1
 9 seniors     0 RACE      B             1
10 seniors     0 RACE      W             1
11 seniors     1 GENDER    Female        2
12 seniors     1 GENDER    Male          1
13 seniors     1 RACE      B             2
14 seniors     1 RACE      W             1

以下是解決問題的兩種方法:

1)這類似於您嘗試在每個列表中進行所有操作,然后最終綁定行。 我們在這里使用imap ,它傳遞列表的名稱和數據。

library(tidyverse)

imap_dfr(dat, ~.x %>%
                 pivot_longer(cols = c(GENDER, RACE)) %>%
                 count(COOL, name, value) %>%
                 mutate(COHORT = .y) %>% 
                 select(COHORT, everything()))

#   COHORT   COOL name   value      n
#   <chr>   <dbl> <chr>  <fct>  <int>
# 1 seniors     0 GENDER Female     1
# 2 seniors     0 GENDER Male       1
# 3 seniors     0 RACE   B          1
# 4 seniors     0 RACE   W          1
# 5 seniors     1 GENDER Female     2
# 6 seniors     1 GENDER Male       1
# 7 seniors     1 RACE   B          2
# 8 seniors     1 RACE   W          1
# 9 juniors     0 GENDER Female     1
#10 juniors     0 RACE   W          1
#11 juniors     1 GENDER Female     1
#12 juniors     1 GENDER Male       3
#13 juniors     1 RACE   B          1
#14 juniors     1 RACE   W          3

2) 對我來說,這是比 1) 更好的方法,因為在這里我們將所有列表的行組合在一起,並且只對完整的數據幀執行一次操作。 這也更短。

bind_rows(dat, .id = "COHORT") %>%
   pivot_longer(cols = c(GENDER, RACE)) %>%
   count(COHORT, COOL, name, value)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM