[英]Count occurrences of factors across multiple columns in grouped dataframe
I have the following dataframe and want to group by the grp
column to see how many of each column-value appears in each group.我有以下 dataframe 并希望按
grp
列进行分组,以查看每个组中出现每个列值的数量。
> data.frame(grp = unlist(strsplit("aabbccca", "")), col1=unlist(strsplit("ABAABBAB", "")), col2=unlist(strsplit("BBCCCCDD", "")))
grp col1 col2
1 a A B
2 a B B
3 b A C
4 b A C
5 c B C
6 c B C
7 c A D
8 a B D
Desired result:期望的结果:
grp col1A col1B col2B col2C col2D
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
If I only look at the grp
and col1
columns, it is easy to solve this using table()
and when there are only 2 columns, I could merge table(df[c('grp', 'col1')])
with table(df[c('grp', 'col2')])
.如果我只看
grp
和col1
列,使用table()
很容易解决这个问题,当只有 2 列时,我可以将table(df[c('grp', 'col1')])
与table(df[c('grp', 'col2')])
合并table(df[c('grp', 'col2')])
。 However, this gets extremely cumbersome as the number of factor columns grows, and is problematic if there are shared values between col1
and col2
.但是,随着因子列数量的增加,这会变得非常麻烦,并且如果
col1
和col2
之间存在共享值,则会出现问题。
Note that dplyr's count doesn't work, as it looks for unique combinations of the col1 and col2.请注意,dplyr 的计数不起作用,因为它会查找 col1 和 col2 的唯一组合。
I've tried melting and spreading the dataframe using tidyr, without any luck我尝试使用 tidyr 熔化和传播 dataframe,但没有任何运气
> pivot_longer(df, c(col1, col2), names_to= "key", values_to = "val") %>% pivot_wider("grp", names_from = c("key", "val"), values_from = 1, values_fn = sum)
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Column `grp` doesn't exist.
I can find plenty of solutions that work for the case where I have 1 group column and 1 value column, but I can't figure out how to generalize them to more columns.我可以找到很多解决方案,适用于我有 1 个组列和 1 个值列的情况,但我不知道如何将它们推广到更多列。
You can stack col1
& col2
together, count the number of each combination, and then transform the table to a wide form.您可以将
col1
& col2
堆叠在一起,计算每个组合的数量,然后将表格转换为宽表格。
library(dplyr)
library(tidyr)
df %>%
pivot_longer(col1:col2) %>%
count(grp, name, value) %>%
pivot_wider(grp, names_from = c(name, value), names_sort = TRUE,
values_from = n, values_fill = 0)
# A tibble: 3 x 6
grp col1_A col1_B col2_B col2_C col2_D
<chr> <int> <int> <int> <int> <int>
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
A base
solution (Thank @GKi to refine the code): base
解决方案(感谢@GKi完善代码):
table(cbind(df["grp"], col=do.call(paste0, stack(df[-1])[2:1])))
col
grp col1A col1B col2B col2C col2D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
Use recast
from reshape2
package:使用
reshape2
package 的recast
:
reshape2::recast(df, grp~variable+value,id.var = 'grp', fun = length)
grp col1_A col1_B col2_B col2_C col2_D
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
In base R you could do:在基础 R 中,您可以执行以下操作:
with(df, cbind(table(grp, paste0('col1_', col1)), table(grp, paste0('col2_', col2))))
col1_A col1_B col2_B col2_C col2_D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
If you have many columns consider doing:如果您有很多列,请考虑这样做:
do.call(cbind, Map(function(x, y) table(df$grp, paste(x,y, sep = '_')),
names(df)[-1], df[,-1]))
col1_A col1_B col2_B col2_C col2_D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
You can then turn this to a dataframe然后你可以把它变成 dataframe
You were on the right track with melt
and spread
.你在
melt
和spread
的正确轨道上。 Here's a tidyverse solution.这是一个整洁的解决方案。 I first use
pivot_longer
to generalise to an arbitrary number of columns and then pivot_wider
to return to the desired output format.我首先使用
pivot_longer
泛化到任意数量的列,然后pivot_wider
返回所需的 output 格式。 The order of columns in the output data frame is data dependent. output 数据帧中的列顺序取决于数据。 If this is an issue, simply append a
select
to the end of the pipe to obtain the desired order.如果这是一个问题,只需 append 和
select
到 pipe 的末尾即可获得所需的顺序。 (Or use names_sort
as in @DarrenTsai's answer.) (或者在@DarrenTsai的回答中使用
names_sort
。)
library(tidyverse)
d %>%
pivot_longer(
starts_with("col"),
names_to="Column",
values_to="Value"
) %>%
group_by(grp, Column, Value) %>%
summarise(N=n(), .groups="drop") %>%
group_by(grp) %>%
pivot_wider(
id_cols=grp,
values_from=N,
names_from=c(Column, Value),
names_sep="",
values_fill=0
) %>%
ungroup()
# A tibble: 3 × 6
grp col1A col1B col2B col2D col2C
<chr> <int> <int> <int> <int> <int>
1 a 1 2 2 1 0
2 b 2 0 0 0 2
3 c 1 2 0 1 2
Another possible solution, based on a tidyr::pivot_longer
followed by a tidyr::pivot_wider
and using values_fn = length
:另一种可能的解决方案,基于
tidyr::pivot_longer
后跟tidyr::pivot_wider
并使用values_fn = length
:
library(tidyverse)
df %>%
pivot_longer(c(col1, col2)) %>%
mutate(name = str_c(name, value)) %>%
pivot_wider(grp, values_fn = length, values_fill = 0, names_sort = T)
#> # A tibble: 3 x 6
#> grp col1A col1B col2B col2C col2D
#> <chr> <int> <int> <int> <int> <int>
#> 1 a 1 2 2 0 1
#> 2 b 2 0 0 2 0
#> 3 c 1 2 0 2 1
In data.table
, we can use dcast
+ melt
like below在
data.table
中,我们可以使用dcast
+ melt
如下所示
dcast(
melt(setDT(df), id.vars = "grp")[
, value := paste(variable, value, sep = "_")
], grp ~ value
)
to produce生产
grp col1_A col1_B col2_B col2_C col2_D
1: a 1 2 2 0 1
2: b 2 0 0 2 0
3: c 1 2 0 2 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.