简体   繁体   中英

How do I generate a raw count of how many times a set of individuals is connected to an individual?

Say I have the following datasets:

name1 <- c("John", "Mary", "Anne", "Joe", "David")
name2 <- c("Mary", "John", "Linda", "David", "Joe")

df <- data.frame(name1, name2)

> df
  name1 name2
1  John  Mary
2  Mary  John
3  Anne Linda
4   Joe David
5 David   Joe

name3 <- c("Kate", "Kate", "Kate", "Roger", "Roger", "Patty", "Patty")
name4 <- c("Mary", "John", "Bob", "David", "Joe", "Anne", "Linda")

df2 <- data.frame(name3, name4)

> df2
  name3 name4
1  Kate  Mary
2  Kate  John
3  Kate   Bob
4 Roger David
5 Roger   Joe
6 Patty  Anne
7 Patty Linda

Names are considered “sets” when they paired with each other. So “John & Mary” is a pair because there's also “Mary & John”.

I want to look at how many times each pair from df (John & Mary and Joe & David) is connected to an individual in df2. So in this toy example, both John and Mary are connected to Kate, and David and Joe are both connected to Roger. If John and Mary were also connected to Roger, they would have been a set for an individual twice, so under No. of times, it would be "2".

For the current dfs, I want a table that shows:

Pair              No. of times
John – Mary       1
Roger – Joe       1

There are some social networking packages out there that provides a visual of how these individuals are connected, but I'm just looking for a simple table that shows the number of counts.

Here's a method that uses the igraph package. First we create a graph from the main data.frame keeping on the "sets" (those vertexes joined by more than node edge). Then we mark those are the ones we are interested in by giving them an edge attribute of "main". We then join those with the rest of the data.

gg1 <- graph_from_data_frame(df, directed = FALSE)
gg1 <- delete_edges(gg1, which(!which_multiple(gg1)))
E(gg1)$main <- TRUE

gg2 <- graph_from_data_frame(df2, directed = FALSE)

ggfull <- union(gg1, gg2)

# (optional) preview results
E(ggfull)$color <- ifelse(!is.na(E(ggfull)$main), "red", "grey")
plot(ggfull)

Now here's a helper function that will go though the graph and find all the "triangles" where one of the edges is from the "main" set.

find_main_trios <- function(g) {
  tricnt <- numeric(gsize(g))
  triset <- triangles(g)
  for(i in seq(1, length(triset), by=3)) {
    edges <- c(
      E(g)[triset[i]%--%triset[i+1]], 
      E(g)[triset[i+1]%--%triset[i+2]],
      E(g)[triset[i]%--%triset[i+2]]
    )
    for (edge in edges)
      if (!is.na(E(g)[edge]$main)) {
        tricnt[edge]  = tricnt[edge] + 1
      }
  }
  do.call("rbind", lapply(which(tricnt>0), function(i) {
    names <- V(g)[inc(i)]$name
    data.frame(name1=names[1], name2=names[2], count=tricnt[i], edgeid=i)
  }))
}

Most of the work is done by the triangles() function which finds sets of three nodes that are all connected to each other. We then need to make sure that each triangle contains one of the sets from the first data.frame we are interested in. The last bid of the function just wrangles everything into a data.frame. So when we run it we get

find_main_trios(ggfull)
#   name1 name2 count edgeid
# 1   Joe David     1      5
# 2  John  Mary     1      9

This gives the summary you are after.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM