简体   繁体   English

如何在R的两列中找到多个值的组合(而不是排列)?

[英]How to find a number of combinations (not permutations) of values in two columns in R?

I have a dataset, where each row represents a project. 我有一个数据集,其中每一行代表一个项目。 Several people may work on every project (between 2 and 6). 每个项目可能需要几个人(介于2到6之间)。

The column names are id, person_1, person_2, person_3, person_4, person_5, person_6 列名是id,person_1,person_2,person_3,person_4,person_5,person_6

If less than 6 people working on a project, only first n columns are filled, rest is NA. 如果在一个项目上工作的人数少于6人,则仅填充前n列,其余为NA。

I need to calculate how many times each pair of individuals worked together, what projects they worked together on, and for each pair of people, A and B what is the share of A's total projects were with B and vice-a-versa. 我需要计算每对个人一起工作了多少次,他们一起从事哪些项目,以及A和B对每对人,A与B的总项目份额是多少,反之亦然。

The issues I am stuck with is that: 1) if I have person_1 A and person_2 B and person_1 B and person_2 B it counts it as two different pairs while I need it to understand that it's the same pair. 我遇到的问题是:1)如果我有person_1 A和person_2 B以及person_1 B和person_2 B,则将其视为两个不同的对,而我需要它来理解它是同一对。 2) Since most projects only involve 2 people, I tried the pairs with only 2 columns, but I need to incorporate columns person_3 - person_6. 2)由于大多数项目只涉及2个人,因此我尝试了只有2列的配对,但是我需要合并person_3-person_6列。

I also have this database in a different format, where each row represents one person working on one project. 我还有一个不同格式的数据库,其中每一行代表一个人在一个项目上工作。 In that case, I have all people in the same column, so my columns are project_id and name. 在那种情况下,我所有的人都在同一列中,所以我的列是project_id和name。

EDIT: example below 编辑:下面的例子

Data <- data.frame(
    id = c(1:4),
    person_1 = c("John", "Dan", "Peter", "James"),
person_2 = c("Dan", "John", "Kate", "Lisa"),
person_3 = c(NA, NA, "Kate", NA),
person_4 = c(NA, NA, "Peter", NA),
person_5 = c(NA, NA, NA, NA),
person_6 = c(NA, NA, NA, NA)
)

EDIT2: In my actual dataset I have 667 unique people and 2442 projects. 编辑2:在我的实际数据集中,我有667个独特的人员和2442个项目。

This is quite straight forward, using matrices, specifically an incidence and adjacency matrix. 使用矩阵,特别是入射和邻接矩阵,这非常简单。 I will note, that this will work better with your 2 column dataset format. 我会注意到,这将更适合您的2列数据集格式。

1) Convert the table to an incidence matrix (1 project per row and 1 person per column), and tabulate the projects each person are assigned. 1)将表转换为关联矩阵(每行1个项目,每列1个人),并列出每个人分配的项目。

data_long <- reshape2::melt(Data, id.vars="proj_id", na.rm=TRUE)
proj_mat <- as.data.frame.table(table(data_long[-2]))
proj_mat <- reshape2::dcast(proj_mat, proj_id ~ value, value.var = "Freq")

#  proj_id Dan James John Kate Lisa Peter
#1      P1   1     0    1    0    0     0
#2      P2   1     0    1    0    0     0
#3      P3   0     0    0    2    0     2
#4      P4   0     1    0    0    1     0

2) To know how many projects each pair of people worked on together you can simply calculate the jaccard index. 2)要知道每对人一起从事多少项目,您可以简单地计算jaccard指数。

proxy::dist(t(proj_mat[-1]), method = "jaccard")

#      Dan James John Kate Lisa
#James   1                     
#John    0     1               
#Kate    1     1    1          
#Lisa    1     0    1    1     
#Peter   1     1    1    0    1

3) To get the proportion of shared projects we calculate the adjacency matrix and then divide by it's diagonal 3)为了获得共享项目的比例,我们计算邻接矩阵,然后除以对角线

proj_adj <- t(proj_mat[-1]) %*% as.matrix((proj_mat[-1]))
prop_together <- proj_adj/diag(proj_adj)
prop_together
#      Dan James John Kate Lisa Peter
#Dan     1     0    1    0    0     0
#James   0     1    0    0    1     0
#John    1     0    1    0    0     0
#Kate    0     0    0    1    0     1
#Lisa    0     1    0    0    1     0
#Peter   0     0    0    1    0     1

Two clarifications: 有两个澄清:
1) Kate and Peter are assigned to the same project twice. 1)凯特和彼得两次被分配到同一项目。 Are these 4 different people? 这是4个人吗? Do you expect them to have 2 projects in common or 1? 您希望他们有2个共同的项目还是1个?

2) I don't follow your first issue. 2)我不关注你的第一期。 Do "A" and "B" refer to projects or their position in your initial table? “ A”和“ B”是指项目还是它们在初始表中的位置? Can you point out an example in your sample data or add an example to your sample data. 您能否在示例数据中指出一个示例,还是将示例添加到示例数据中。

if I have person_1 A and person_2 B and person_1 B and person_2 B it 如果我有person_1 A和person_2 B以及person_1 B和person_2 B
counts it as two different pairs while I need it to understand that 将其视为两个不同的对,而我需要它来理解
it's the same pair. 这是同一对。

getCombs <- function(..., data = dat[,-1]){
    names <- list(...)
    pers1InProj <- vapply(names, function (name)  
                                 {rowSums(data == name, na.rm = T) > 0}, 
                                  numeric(nrow(data)))


    return( rowSums(pers1InProj) == length(names)) 
}


An example: 一个例子:

> Data
  id person_1 person_2 person_3 person_4 person_5 person_6
1  1     John      Dan     <NA>     <NA>       NA       NA
2  2      Dan     John     <NA>     <NA>       NA       NA
3  3    Peter     Kate     Kate    Peter       NA       NA
4  4    James     Lisa     <NA>     <NA>       NA       NA
> which(getCombs("John", "Dan", data = Data))
 [1] 1 2
> which(getCombs("Peter", "John", "Dan", data = Data))
 integer(0)

which(getCombs("Peter", "Kate", data = Data))
 [1] 3

Code for all pairs: 所有对的代码:

persons <- c("Peter", "John", "Dan", "James", "Kate", "Lisa")

tmp <- combn(persons, 2); #create all pairs of two elements from persons
tmp
     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   [,7]    [,8]   [,9]  
[1,] "Peter" "Peter" "Peter" "Peter" "Peter" "John" "John"  "John" "John"
[2,] "John"  "Dan"   "James" "Kate"  "Lisa"  "Dan"  "James" "Kate" "Lisa"
     [,10]   [,11]  [,12]  [,13]   [,14]   [,15] 
[1,] "Dan"   "Dan"  "Dan"  "James" "James" "Kate"
[2,] "James" "Kate" "Lisa" "Kate"  "Lisa"  "Lisa"

# apply getCombs to every column fo tmp:
res <- apply(tmp, 2, function(pair) getCombs(pair[1], pair[2], data = Data))

# set colnames(res) for simplicity:
colnames(res) <- paste(tmp[1,], tmp[2,], sep="_")

# for example the sixth pair is c("John", "Dan"), they work together on 
 #project1 and project2 therefore res[1,6] == res[2,6]==TRUE:

> res
     Peter_John Peter_Dan Peter_James Peter_Kate Peter_Lisa John_Dan John_James John_Kate
[1,]      FALSE     FALSE       FALSE      FALSE      FALSE     TRUE      FALSE     FALSE
[2,]      FALSE     FALSE       FALSE      FALSE      FALSE     TRUE      FALSE     FALSE
[3,]      FALSE     FALSE       FALSE       TRUE      FALSE    FALSE      FALSE     FALSE
[4,]      FALSE     FALSE       FALSE      FALSE      FALSE    FALSE      FALSE     FALSE
     John_Lisa Dan_James Dan_Kate Dan_Lisa James_Kate James_Lisa Kate_Lisa
[1,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[2,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[3,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[4,]     FALSE     FALSE    FALSE    FALSE      FALSE       TRUE     FALSE
> 
# number of times people worked together: 
 colSums(res)
 Peter_John   Peter_Dan Peter_James  Peter_Kate  Peter_Lisa    John_Dan  John_James 
          0           0           0           1           0           2           0 
  John_Kate   John_Lisa   Dan_James    Dan_Kate    Dan_Lisa  James_Kate  James_Lisa 
          0           0           0           0           0           0           1 
  Kate_Lisa 
          0 

Solution if res doesnt fit into menory: 如果res不能满足要求,请采取以下解决方案:

count <- integer(ncol(tmp))
names(count) <- colnames(res)
for(i in 1:ncol(tmp)) count[i] <- sum(getCombs(tmp[1,i], tmp[2,i], data = Data))

> count
 Peter_John   Peter_Dan Peter_James  Peter_Kate  Peter_Lisa    John_Dan  John_James 
          0           0           0           1           0           2           0 
  John_Kate   John_Lisa   Dan_James    Dan_Kate    Dan_Lisa  James_Kate  James_Lisa 
          0           0           0           0           0           0           1 
  Kate_Lisa 
          0 

 all.equal(count, colSums(res))
[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM