如何在R中向量化或以其他方式加速這種循環邏輯？

Question

長時間潛伏，第一次問問。

我正在嘗試為20M +項目數據集計算“兩組項目之間的共同項”。 示例數據如下所示。

#serially numbered items
parents <- rep(1:10000)

#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))

#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
  if (numchild[x]>0){
    f1 <- sample(1:length(parents), size=numchild[x])
    f2 <- list(parents[f1])
    parent_child <- c(parent_child, f2)
  }
  else {
    parent_child <- c(parent_child, list(x+1))    #if numchild=0, make up something
  }
}

這就是我想要做的事情：說父項＃1有5個子項 - 1,2,3,4,5，父項＃2有3個子項 - 4,10,22。

我想計算每個（parent_i，parent_j）組合的長度（交集）。 在上面的例子中，它將是1個共同項 - 4。

我這樣做是為了10M +父項目，平均有15-20個兒童項目（0,100）范圍。 這是一個10M x 10M的項目矩陣。

我有一個foreach循環，我正在測試一個較小的子集，但不能完全擴展整個數據集（64核心機器具有256GB RAM）。 在下面的循環中，為了這個目的，我已經只計算了用戶用戶矩陣的一半 - >（parent_i，parent_j）和（parent_j，parent_i）相同。

#small subset
a <- parent_child[1:1000]

outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
  b <- a[[i]]
  rest <- a[i+1:length(a)]

  foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
    common <- length(intersect(b, rest[[j]]))
    if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
  }  
}

我一直在試驗這方面的變化（使用Reduce，將父母子女存儲在daataframe等中），但沒有太多運氣。

有沒有辦法實現這種規模？

Answer 1

我扭轉了分裂，以便我們有一個孩子與父母的關系

len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len), 
                      unlist(parent_child, use.names=FALSE))

像下面這樣的東西構建了一個字符串，其中父母對共享一個孩子

keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
    x <- combn(sort(x), 2)
    paste(x[1,], x[2,], sep=".")
})

和理貨

table(unlist(int, use.names=FALSE))

或者更快一點

xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)

對於

f1 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], function(x) {
        x <- combn(sort(x), 2)
        paste(x[1,], x[2,], sep=".")
    })

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

with（這適用於所有10000個父子元素）

> system.time(ans1 <- f1(parent_child))
   user  system elapsed 
 14.625   0.012  14.668 
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 
       1        1        1        1        1        1

我不確定這是否會真正擴展到你所談論的問題的大小，但它是每個孩子的父母數量的多項式。

加速的一種可能性是“記憶”組合計算，使用參數的長度作為“關鍵點”並將組合存儲為“值”。 這將combn的次數減少到combn的唯一元素長度的數量。

combn1 <- local({
    memo <- new.env(parent=emptyenv())
    function(x) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
    }
})

f2 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn1)

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

這有點幫助

>     system.time(ans2 <- f2(parent_child))
   user  system elapsed 
  5.337   0.000   5.347 
>     identical(ans1, ans2)
[1] TRUE

緩慢的部分現在paste

>     Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
                 self.time self.pct total.time total.pct
"paste"               3.92    73.41       3.92     73.41
"match"               0.74    13.86       0.74     13.86
"unique.default"      0.40     7.49       0.40      7.49
"as.character"        0.08     1.50       0.08      1.50
"unlist"              0.08     1.50       0.08      1.50
"combn"               0.06     1.12       0.06      1.12
"lapply"              0.02     0.37       4.00     74.91
"any"                 0.02     0.37       0.02      0.37
"setNames"            0.02     0.37       0.02      0.37

$by.total
...

我們可以通過將具有共享子ID的父代碼編碼為單個整數來避免這種情況; 因為浮點數用R表示的方式，這將是精確的直到大約2 ^ 21

encode <- function(x, y, n)
    (x - 1) * (n + 1) + y
decode <- function(z, n)
    list(x=ceiling(z / (n + 1)), y = z %% (n + 1))

並調整我們的combn1和f2函數

combn2 <- local({
    memo <- new.env(parent=emptyenv())
    function(x, encode_n) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
    }
})

f3 <- function(parent_child) {
    encode_n <- length(parent_child)
    len <- sapply(parent_child, length)
    child_parent <-
        unname(split(rep(seq_along(parent_child), len), 
                     unlist(parent_child, use.names=FALSE)))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn2, encode_n)

    id <- unlist(int, use.names=FALSE)
    uid <- unique(xx)
    n <- tabulate(match(xx, uid), length(uid))
    do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}

導致

> system.time(f3(parent_child))
   user  system elapsed 
  2.140   0.000   2.146

這與jlhoward的修訂答案相比非常有利（注意前一行中的時間是10,000個父子關系）

> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
  2.465   0.000   2.468
> system.time(f3(parent_child[1:99]))
   user  system elapsed 
  0.016   0.000   0.014

並以更合理的方式進行擴展。

值得一提的是，數據生成例程位於Patrick Burn的R Inferno的第二個循環中，使用“復制 - 追加”算法而不是預先分配空間並填充它。通過編寫for循環體來避免這種情況作為一個功能，並使用lapply。 通過事前修復問題，避免在for循環中需要復雜的條件

numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))

或者從生成正整數值的分布（rpois，rbinom）中抽樣。 然后生成數據

n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)

Answer 2

這是另一種方法，比我之前的答案快10 倍，比原始代碼快17倍 （也更簡單）：

ff <- function(u2, u1, a) {
  common <- length(intersect(a,parent_child[[u2]]))
  if (common>0) {return(data.frame(u1,u2,common))}
}

gg <- function(u1) {
  a <- parent_child[[u1]]
  do.call("rbind",lapply((u1+1):100,ff,u1,a))
}

system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
   1.04    0.00    1.03

result.3是相同的result.2從前面的答案：

max(abs(result.3-result.2))
[1] 0

Answer 3

好吧，一點點改進（我認為）：

原始代碼（包含在函數調用中）：

f = function(n) {
  #small subset
  a <- parent_child[1:n]

  outerresults <- foreach (i = 1:(length(a)), 
                           .combine=rbind,
                           .packages=c('foreach','doParallel')) %dopar% {
    b <- a[[i]]
    rest <- a[i+1:length(a)]

    foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
      common <- length(intersect(b, rest[[j]]))
      if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
    }  
  }  
  return(outerresults)
}

修改后的代碼

g <- function(n) {
  a <- parent_child[1:n]

  outerresults <- foreach (i = 1:n, 
                           .combine=rbind, 
                           .packages=c('foreach','doParallel')) %dopar% {
    b <- a[[i]]

    foreach (j = (i):n, .combine=rbind) %dopar% {
      if (i!=j) {
        c <- a[[j]]
        common <- length(intersect(b, c))
        if (common > 0) {g <- data.frame(u1=i, u2=j, common)}
      }
    }  
  }
  return(outerresults)
}

基准：

system.time(result.old<-f(100))
   user  system elapsed 
  17.21    0.00   17.33 
system.time(result.new<-g(100))
   user  system elapsed 
  10.42    0.00   10.47

由於不同的方法，u2的編號略有不同，但兩者都產生相同的匹配向量：

max(abs(result.old$common-result.new$common))
[1] 0

我嘗試使用數據表連接替換intersect(...) ，它實際上要慢得多（!!）

如何在R中向量化或以其他方式加速這種循環邏輯？

問題描述

3 個解決方案

解決方案1
6 2013-12-01 06:00:13

解決方案2
2 2013-12-01 15:59:56

解決方案3
1 2013-12-01 02:41:45

如何在R中向量化或以其他方式加速這種循環邏輯？

問題描述

3 個解決方案

解決方案1 6 2013-12-01 06:00:13

解決方案2 2 2013-12-01 15:59:56

解決方案3 1 2013-12-01 02:41:45

解決方案1
6 2013-12-01 06:00:13

解決方案2
2 2013-12-01 15:59:56

解決方案3
1 2013-12-01 02:41:45