简体   繁体   English

计算分组 dataframe 中多个列中因子的出现次数

[英]Count occurrences of factors across multiple columns in grouped dataframe

I have the following dataframe and want to group by the grp column to see how many of each column-value appears in each group.我有以下 dataframe 并希望按grp列进行分组,以查看每个组中出现每个列值的数量。

> data.frame(grp = unlist(strsplit("aabbccca", "")), col1=unlist(strsplit("ABAABBAB", "")), col2=unlist(strsplit("BBCCCCDD", "")))
  grp col1 col2
1   a    A    B
2   a    B    B
3   b    A    C
4   b    A    C
5   c    B    C
6   c    B    C
7   c    A    D
8   a    B    D

Desired result:期望的结果:

  grp col1A col1B col2B col2C col2D
1   a    1    2     2     0     1
2   b    2    0     0     2     0
3   c    1    2     0     2     1

If I only look at the grp and col1 columns, it is easy to solve this using table() and when there are only 2 columns, I could merge table(df[c('grp', 'col1')]) with table(df[c('grp', 'col2')]) .如果我只看grpcol1列,使用table()很容易解决这个问题,当只有 2 列时,我可以将table(df[c('grp', 'col1')])table(df[c('grp', 'col2')])合并table(df[c('grp', 'col2')]) However, this gets extremely cumbersome as the number of factor columns grows, and is problematic if there are shared values between col1 and col2 .但是,随着因子列数量的增加,这会变得非常麻烦,并且如果col1col2之间存在共享值,则会出现问题。

Note that dplyr's count doesn't work, as it looks for unique combinations of the col1 and col2.请注意,dplyr 的计数不起作用,因为它会查找 col1 和 col2 的唯一组合。

I've tried melting and spreading the dataframe using tidyr, without any luck我尝试使用 tidyr 熔化和传播 dataframe,但没有任何运气

> pivot_longer(df, c(col1, col2), names_to= "key", values_to = "val") %>% pivot_wider("grp", names_from = c("key", "val"), values_from = 1, values_fn = sum)
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Column `grp` doesn't exist.

I can find plenty of solutions that work for the case where I have 1 group column and 1 value column, but I can't figure out how to generalize them to more columns.我可以找到很多解决方案,适用于我有 1 个组列和 1 个值列的情况,但我不知道如何将它们推广到更多列。

You can stack col1 & col2 together, count the number of each combination, and then transform the table to a wide form.您可以将col1 & col2堆叠在一起,计算每个组合的数量,然后将表格转换为宽表格。

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(col1:col2) %>%
  count(grp, name, value) %>%
  pivot_wider(grp, names_from = c(name, value), names_sort = TRUE,
              values_from = n, values_fill = 0)

# A tibble: 3 x 6
  grp   col1_A col1_B col2_B col2_C col2_D
  <chr>  <int>  <int>  <int>  <int>  <int>
1 a          1      2      2      0      1
2 b          2      0      0      2      0
3 c          1      2      0      2      1

A base solution (Thank @GKi to refine the code): base解决方案(感谢@GKi完善代码):

table(cbind(df["grp"], col=do.call(paste0, stack(df[-1])[2:1])))

   col
grp col1A col1B col2B col2C col2D
  a     1     2     2     0     1
  b     2     0     0     2     0
  c     1     2     0     2     1

Use recast from reshape2 package:使用reshape2 package 的recast

reshape2::recast(df, grp~variable+value,id.var = 'grp', fun = length)

  grp col1_A col1_B col2_B col2_C col2_D
1   a      1      2      2      0      1
2   b      2      0      0      2      0
3   c      1      2      0      2      1

In base R you could do:在基础 R 中,您可以执行以下操作:

with(df, cbind(table(grp, paste0('col1_', col1)), table(grp, paste0('col2_', col2))))

  col1_A col1_B col2_B col2_C col2_D
a      1      2      2      0      1
b      2      0      0      2      0
c      1      2      0      2      1

If you have many columns consider doing:如果您有很多列,请考虑这样做:

do.call(cbind, Map(function(x, y) table(df$grp, paste(x,y, sep = '_')),
                        names(df)[-1], df[,-1]))

  col1_A col1_B col2_B col2_C col2_D
a      1      2      2      0      1
b      2      0      0      2      0
c      1      2      0      2      1

You can then turn this to a dataframe然后你可以把它变成 dataframe

You were on the right track with melt and spread .你在meltspread的正确轨道上。 Here's a tidyverse solution.这是一个整洁的解决方案。 I first use pivot_longer to generalise to an arbitrary number of columns and then pivot_wider to return to the desired output format.我首先使用pivot_longer泛化到任意数量的列,然后pivot_wider返回所需的 output 格式。 The order of columns in the output data frame is data dependent. output 数据帧中的列顺序取决于数据。 If this is an issue, simply append a select to the end of the pipe to obtain the desired order.如果这是一个问题,只需 append 和select到 pipe 的末尾即可获得所需的顺序。 (Or use names_sort as in @DarrenTsai's answer.) (或者在@DarrenTsai的回答中使用names_sort 。)

library(tidyverse)

d %>% 
  pivot_longer(
    starts_with("col"),
    names_to="Column",
    values_to="Value"
  ) %>% 
  group_by(grp, Column, Value) %>% 
  summarise(N=n(), .groups="drop") %>% 
  group_by(grp) %>% 
  pivot_wider(
    id_cols=grp,
    values_from=N,
    names_from=c(Column, Value),
    names_sep="",
    values_fill=0
  ) %>%
  ungroup()
# A tibble: 3 × 6
  grp   col1A col1B col2B col2D col2C
  <chr> <int> <int> <int> <int> <int>
1 a         1     2     2     1     0
2 b         2     0     0     0     2
3 c         1     2     0     1     2

Another possible solution, based on a tidyr::pivot_longer followed by a tidyr::pivot_wider and using values_fn = length :另一种可能的解决方案,基于tidyr::pivot_longer后跟tidyr::pivot_wider并使用values_fn = length

library(tidyverse)

df %>% 
  pivot_longer(c(col1, col2)) %>% 
  mutate(name = str_c(name, value)) %>% 
  pivot_wider(grp, values_fn = length, values_fill = 0, names_sort = T)

#> # A tibble: 3 x 6
#>   grp   col1A col1B col2B col2C col2D
#>   <chr> <int> <int> <int> <int> <int>
#> 1 a         1     2     2     0     1
#> 2 b         2     0     0     2     0
#> 3 c         1     2     0     2     1

In data.table , we can use dcast + melt like belowdata.table中,我们可以使用dcast + melt如下所示

dcast(
    melt(setDT(df), id.vars = "grp")[
        , value := paste(variable, value, sep = "_")
    ], grp ~ value
)

to produce生产

   grp col1_A col1_B col2_B col2_C col2_D
1:   a      1      2      2      0      1
2:   b      2      0      0      2      0
3:   c      1      2      0      2      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM