简体   繁体   English

数据帧R中值组合的计数

[英]counts of combinations of values in a dataframe R

I have a dataframe like so: 我有一个像这样的数据框:

    df<-structure(list(id = c("A", "A", "A", "B", "B", "C", "C", "D", 
"D", "E", "E"), expertise = c("r", "python", "julia", "python", 
"r", "python", "julia", "python", "julia", "r", "julia")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("id", 
"expertise"), spec = structure(list(cols = structure(list(id = structure(list(), class = c("collector_character", 
"collector")), expertise = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("id", "expertise")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

df
   id expertise
1   A         r
2   A    python
3   A     julia
4   B    python
5   B         r
6   C    python
7   C     julia
8   D    python
9   D     julia
10  E         r
11  E     julia

I can get the overall counts of "expertise" by using: 我可以使用以下方法获得“专业知识”的总数:

library(dplyr)    
df %>% group_by(expertise) %>% mutate (counts_overall= n()) 

However what I want is the counts for combinations of expertise values. 但是我想要的是专业知识价值组合的计数。 In other words how many "id" had the same combination of two expertise eg "r" and"julia"? 换句话说,有多少“ id”具有两个专业知识(例如“ r”和“ julia”)的相同组合? Here is a desired output: 这是所需的输出:

df_out<-structure(list(expertise1 = c("r", "r", "python"), expertise2 = c("python", 
"julia", "julia"), count = c(2L, 2L, 3L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("expertise1", 
"expertise2", "count"), spec = structure(list(cols = structure(list(
    expertise1 = structure(list(), class = c("collector_character", 
    "collector")), expertise2 = structure(list(), class = c("collector_character", 
    "collector")), count = structure(list(), class = c("collector_integer", 
    "collector"))), .Names = c("expertise1", "expertise2", "count"
)), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

df_out
  expertise1 expertise2 count
1          r     python     2
2          r      julia     2
3     python      julia     3

The linked answer from latemail's comment creates a matrix Latemail评论链接答案将创建一个矩阵

crossprod(table(df) > 0)
  expertise expertise julia python r julia 4 3 2 python 3 4 2 r 2 2 3 

while the OP expects a dataframe in long format. OP希望使用长格式的数据帧。

1) cross join 1)交叉加入

Below is a data.table solution which uses the CJ() ( cross join ) function: 以下是使用CJ()交叉 data.table )函数的data.table解决方案:

library(data.table)
setDT(df)[, CJ(expertise, expertise)[V1 < V2], by = id][
  , .N, by = .(expertise1 = V1, expertise2 = V2)]
  expertise1 expertise2 N 1: julia python 3 2: julia r 2 3: python r 2 

CJ(expertise, expertise)[V1 < V2] is the data.table equivalent for t(combn(df$expertise, 2)) or combinat::combn2(df$expertise) . CJ(expertise, expertise)[V1 < V2]data.table等效t(combn(df$expertise, 2))combinat::combn2(df$expertise)

2) self-join 2)自我加入

Here is another variant which uses a self-join : 这是另一个使用联接的变体:

library(data.table)
setDT(df)[df, on = "id", allow = TRUE][
  expertise < i.expertise, .N, by = .(expertise1 = expertise, expertise2 = i.expertise)]
  expertise1 expertise2 N 1: python r 2 2: julia r 2 3: julia python 3 

A solution not as efficient as crossprod-table approach but easy to understand: 一种解决方案不如交叉产品表方法有效,但易于理解:

library(tidyr)

df %>% group_by(id) %>%
    summarize(expertise = list(combn(sort(expertise), 2, FUN = paste, collapse = '_'))) %>%
    unnest(expertise) %>%
    group_by(expertise) %>%
    summarize(count = n()) %>%
    separate(expertise, c('expertise1', 'expertise2'), sep = '_')

# # A tibble: 3 x 3
#   expertise1 expertise2 count
#   <chr>      <chr>      <int>
# 1 julia      python         3
# 2 julia      r              2
# 3 python     r              2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM