I have a dataframe like so:
df<-structure(list(id = c("A", "A", "A", "B", "B", "C", "C", "D",
"D", "E", "E"), expertise = c("r", "python", "julia", "python",
"r", "python", "julia", "python", "julia", "r", "julia")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("id",
"expertise"), spec = structure(list(cols = structure(list(id = structure(list(), class = c("collector_character",
"collector")), expertise = structure(list(), class = c("collector_character",
"collector"))), .Names = c("id", "expertise")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
df
id expertise
1 A r
2 A python
3 A julia
4 B python
5 B r
6 C python
7 C julia
8 D python
9 D julia
10 E r
11 E julia
I can get the overall counts of "expertise" by using:
library(dplyr)
df %>% group_by(expertise) %>% mutate (counts_overall= n())
However what I want is the counts for combinations of expertise values. In other words how many "id" had the same combination of two expertise eg "r" and"julia"? Here is a desired output:
df_out<-structure(list(expertise1 = c("r", "r", "python"), expertise2 = c("python",
"julia", "julia"), count = c(2L, 2L, 3L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("expertise1",
"expertise2", "count"), spec = structure(list(cols = structure(list(
expertise1 = structure(list(), class = c("collector_character",
"collector")), expertise2 = structure(list(), class = c("collector_character",
"collector")), count = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("expertise1", "expertise2", "count"
)), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
df_out
expertise1 expertise2 count
1 r python 2
2 r julia 2
3 python julia 3
The linked answer from latemail's comment creates a matrix
crossprod(table(df) > 0)
expertise expertise julia python r julia 4 3 2 python 3 4 2 r 2 2 3
while the OP expects a dataframe in long format.
Below is a data.table
solution which uses the CJ()
( cross join ) function:
library(data.table)
setDT(df)[, CJ(expertise, expertise)[V1 < V2], by = id][
, .N, by = .(expertise1 = V1, expertise2 = V2)]
expertise1 expertise2 N 1: julia python 3 2: julia r 2 3: python r 2
CJ(expertise, expertise)[V1 < V2]
is the data.table
equivalent for t(combn(df$expertise, 2))
or combinat::combn2(df$expertise)
.
Here is another variant which uses a self-join :
library(data.table)
setDT(df)[df, on = "id", allow = TRUE][
expertise < i.expertise, .N, by = .(expertise1 = expertise, expertise2 = i.expertise)]
expertise1 expertise2 N 1: python r 2 2: julia r 2 3: julia python 3
A solution not as efficient as crossprod-table approach but easy to understand:
library(tidyr)
df %>% group_by(id) %>%
summarize(expertise = list(combn(sort(expertise), 2, FUN = paste, collapse = '_'))) %>%
unnest(expertise) %>%
group_by(expertise) %>%
summarize(count = n()) %>%
separate(expertise, c('expertise1', 'expertise2'), sep = '_')
# # A tibble: 3 x 3
# expertise1 expertise2 count
# <chr> <chr> <int>
# 1 julia python 3
# 2 julia r 2
# 3 python r 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.