简体   繁体   English

如何在R中向量化或以其他方式加速这种循环逻辑?

[英]How to vectorize or otherwise speed-up this looping logic in R?

Long time lurker, first time asker. 长时间潜伏,第一次问问。

I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. 我正在尝试为20M +项目数据集计算“两组项目之间的共同项”。 Sample data looks like this. 示例数据如下所示。

#serially numbered items
parents <- rep(1:10000)

#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))

#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
  if (numchild[x]>0){
    f1 <- sample(1:length(parents), size=numchild[x])
    f2 <- list(parents[f1])
    parent_child <- c(parent_child, f2)
  }
  else {
    parent_child <- c(parent_child, list(x+1))    #if numchild=0, make up something
  }
}

Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22. 这就是我想要做的事情:说父项#1有5个子项 - 1,2,3,4,5,父项#2有3个子项 - 4,10,22。

I want to compute the length(intersection) of every (parent_i, parent_j) combination. 我想计算每个(parent_i,parent_j)组合的长度(交集)。 In the above case, it would be 1 common item-- 4. 在上面的例子中,它将是1个共同项 - 4。

I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. 我这样做是为了10M +父项目,平均有15-20个儿童项目(0,100)范围。 So that's a 10M x 10M item-item matrix. 这是一个10M x 10M的项目矩阵。

I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). 我有一个foreach循环,我正在测试一个较小的子集,但不能完全扩展整个数据集(64核心机器具有256GB RAM)。 With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose. 在下面的循环中,为了这个目的,我已经只计算了用户用户矩阵的一半 - >(parent_i,parent_j)和(parent_j,parent_i)相同。

#small subset
a <- parent_child[1:1000]

outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
  b <- a[[i]]
  rest <- a[i+1:length(a)]

  foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
    common <- length(intersect(b, rest[[j]]))
    if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
  }  
}

I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck. 我一直在试验这方面的变化(使用Reduce,将父母子女存储在daataframe等中),但没有太多运气。

Is there a way to make this scale? 有没有办法实现这种规模?

I reversed the split, so that we have a child-parent relationship 我扭转了分裂,以便我们有一个孩子与父母的关系

len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len), 
                      unlist(parent_child, use.names=FALSE))

Something like the following constructs a string with pairs of parents sharing a child, across all children 像下面这样的东西构建了一个字符串,其中父母对共享一个孩子

keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
    x <- combn(sort(x), 2)
    paste(x[1,], x[2,], sep=".")
})

and tallying 和理货

table(unlist(int, use.names=FALSE))

or a little more quickly 或者更快一点

xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)

for 对于

f1 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], function(x) {
        x <- combn(sort(x), 2)
        paste(x[1,], x[2,], sep=".")
    })

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

with (this is for all 10000 parent-child elements) with(这适用于所有10000个父子元素)

> system.time(ans1 <- f1(parent_child))
   user  system elapsed 
 14.625   0.012  14.668 
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 
       1        1        1        1        1        1 

I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child. 我不确定这是否会真正扩展到你所谈论的问题的大小,但它是每个孩子的父母数量的多项式。

One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. 加速的一种可能性是“记忆”组合计算,使用参数的长度作为“关键点”并将组合存储为“值”。 This reduces the number of times combn is called to the number of unique lengths of elements of child_parent. 这将combn的次数减少到combn的唯一元素长度的数量。

combn1 <- local({
    memo <- new.env(parent=emptyenv())
    function(x) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
    }
})

f2 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn1)

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

which helps somewhat 这有点帮助

>     system.time(ans2 <- f2(parent_child))
   user  system elapsed 
  5.337   0.000   5.347 
>     identical(ans1, ans2)
[1] TRUE

The slow part is now paste 缓慢的部分现在paste

>     Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
                 self.time self.pct total.time total.pct
"paste"               3.92    73.41       3.92     73.41
"match"               0.74    13.86       0.74     13.86
"unique.default"      0.40     7.49       0.40      7.49
"as.character"        0.08     1.50       0.08      1.50
"unlist"              0.08     1.50       0.08      1.50
"combn"               0.06     1.12       0.06      1.12
"lapply"              0.02     0.37       4.00     74.91
"any"                 0.02     0.37       0.02      0.37
"setNames"            0.02     0.37       0.02      0.37

$by.total
...

We can avoid this by encoding the parents with shared child id into a single integer; 我们可以通过将具有共享子ID的父代码编码为单个整数来避免这种情况; because of the way floating point numbers are represented in R, this will be exact until about 2^21 因为浮点数用R表示的方式,这将是精确的直到大约2 ^ 21

encode <- function(x, y, n)
    (x - 1) * (n + 1) + y
decode <- function(z, n)
    list(x=ceiling(z / (n + 1)), y = z %% (n + 1))

and adjusting our combn1 and f2 functions as 并调整我们的combn1和f2函数

combn2 <- local({
    memo <- new.env(parent=emptyenv())
    function(x, encode_n) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
    }
})

f3 <- function(parent_child) {
    encode_n <- length(parent_child)
    len <- sapply(parent_child, length)
    child_parent <-
        unname(split(rep(seq_along(parent_child), len), 
                     unlist(parent_child, use.names=FALSE)))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn2, encode_n)

    id <- unlist(int, use.names=FALSE)
    uid <- unique(xx)
    n <- tabulate(match(xx, uid), length(uid))
    do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}

leading to 导致

> system.time(f3(parent_child))
   user  system elapsed 
  2.140   0.000   2.146 

This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer 这与jlhoward的修订答案相比非常有利(注意前一行中的时间是10,000个父子关系)

> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
  2.465   0.000   2.468
> system.time(f3(parent_child[1:99]))
   user  system elapsed 
  0.016   0.000   0.014 

and scales in a much more reasonable way. 并以更合理的方式进行扩展。

For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno , using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. 值得一提的是,数据生成例程位于Patrick Burn的R Inferno的第二个循环中,使用“复制 - 追加”算法而不是预先分配空间并填充它。通过编写for循环体来避免这种情况作为一个功能,并使用lapply。 Avoid the need for the complicated conditional in the for loop by fixing the issue before-hand 通过事前修复问题,避免在for循环中需要复杂的条件

numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))

or by sampling from a distribution (rpois, rbinom) that generates positive integer values. 或者从生成正整数值的分布(rpois,rbinom)中抽样。 Data generation is then 然后生成数据

n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)

Here is another approach that is about 10X faster than my previous answer, and 17X faster than the original code (also simpler): 这是另一种方法,比我之前的答案快10 倍,比原始代码快17倍 (也更简单):

ff <- function(u2, u1, a) {
  common <- length(intersect(a,parent_child[[u2]]))
  if (common>0) {return(data.frame(u1,u2,common))}
}

gg <- function(u1) {
  a <- parent_child[[u1]]
  do.call("rbind",lapply((u1+1):100,ff,u1,a))
}

system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
   1.04    0.00    1.03 

result.3 is identical to result.2 from previous answer: result.3是相同的result.2从前面的答案:

max(abs(result.3-result.2))
[1] 0

Well, a small improvement (I think): 好吧,一点点改进(我认为):

Original code (wrapped in function call): 原始代码(包含在函数调用中):

f = function(n) {
  #small subset
  a <- parent_child[1:n]

  outerresults <- foreach (i = 1:(length(a)), 
                           .combine=rbind,
                           .packages=c('foreach','doParallel')) %dopar% {
    b <- a[[i]]
    rest <- a[i+1:length(a)]

    foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
      common <- length(intersect(b, rest[[j]]))
      if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
    }  
  }  
  return(outerresults)
}

Modified code: 修改后的代码

g <- function(n) {
  a <- parent_child[1:n]

  outerresults <- foreach (i = 1:n, 
                           .combine=rbind, 
                           .packages=c('foreach','doParallel')) %dopar% {
    b <- a[[i]]

    foreach (j = (i):n, .combine=rbind) %dopar% {
      if (i!=j) {
        c <- a[[j]]
        common <- length(intersect(b, c))
        if (common > 0) {g <- data.frame(u1=i, u2=j, common)}
      }
    }  
  }
  return(outerresults)
}

Benchmarks: 基准:

system.time(result.old<-f(100))
   user  system elapsed 
  17.21    0.00   17.33 
system.time(result.new<-g(100))
   user  system elapsed 
  10.42    0.00   10.47 

The numbering for u2 is a little different becasue of the different approaches, but both produce the same vector of matches: 由于不同的方法,u2的编号略有不同,但两者都产生相同的匹配向量:

max(abs(result.old$common-result.new$common))
[1] 0

I tried this with data table joins replacing intersect(...) and it was actually much slower(!!) 我尝试使用数据表连接替换intersect(...) ,它实际上要慢得多(!!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM