简体   繁体   中英

R - joining more than 2^31 rows with data.table

I have an igraph .network graph with 103,887 nodes and 4,795,466 ties.

This can be structured as an edgelist in a data.table with almost 9 million rows.

I can find the common neighbors in this.network, following @chinsoon12's answer here . See the example below.

This works beautifully for smaller.networks, but runs into problems in my use-case because the merge results in more than 2^31 rows .

Questions:

  • Are there efficient alternatives on how to deal with this?
  • Can I split the data and do the computation in steps? The results will be used to query about common neighbors.

Example - modified from @chinsoon12's answer:

library(data.table)
library(igraph)

set.seed(1234)
g <- random.graph.game(10, p=0.10)

adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM@i+1, V2=adjSM@j+1)

res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
  by=c("V1","i.V1")][order(V1)]

res

   V1 i.V1 Neighbours
1:  4    5          8
2:  4   10          8
3:  5   10          8

common neighbors

Can I split the data and do the computation in steps?

You can split by V1 to avoid running into the big-merge issue:

neighDT = adjDT[, if (.N > 1) {
    cb = combn(V2, 2)
    .(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]

which gives

   neighbor a  b
1:        8 4  5
2:        8 4 10
3:        8 5 10

(The OP found gRbase::combnPrim to be faster than combn here.)

How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?

neighDT_agg = neighDT[order(neighbor), 
  .(neighbors = toString(neighbor))
, keyby=.(a,b)]

The order ensures that the string is sorted alphabetically. The keyby ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:

# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"

# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA

I have an igraph.network graph with 103,887 nodes and 4,795,466 ties.

Each call to combn will be making a 2-by- choose(.N, 2) matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2) of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.


The results will be used to query about common neighbors.

For the approach above, you'll need to compute the full neighbors table first.

If you just have a few ad hoc queries about intersecting neighbors:

find_neighbors <- function(a, b){
    adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
}

find_neighbors(4, 10)
# [1] 8

This can similarly be wrapped in toString to collapse the values.

Update

  • If you just want to query the common neighbors, I don't suggest you build up a huge look-up table. Instead, you can use the following code to get the result for your query:
find_common_neighbors <- function(g, Vs) {
  which(colSums(distances(g, Vs) == 1) == length(Vs))
}

such that

> find_common_neighbors(g, c(4, 8))
integer(0)

> find_common_neighbors(g, c(4, 5))
[1] 8
  • If you need a look-up table, an alternative is to use Neighbours as the key to search its associated node, eg,
res <- transform(
  data.frame(Neighbours = which(degree(g) >= 2)),
  Nodes = sapply(
    Neighbours,
    function(x) toString(neighbors(g, x))
  )
)

Previous Answer

I think you can use ego over g directly to generate res , eg,

setNames(
  data.frame(
    t(do.call(
      cbind,
      lapply(
        Filter(function(x) length(x) > 2, ego(g, 1)),
        function(x) {
          rbind(combn(x[-1], 2), x[1])
        }
      )
    ))
  ),
  c("V1", "V2", "Neighbours")
)

which gives

  V1 V2 Neighbours
1  4  5          8
2  4 10          8
3  5 10          8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM