加速簡單的R代碼（矢量化？）

Question

我有兩個正整數向量指定范圍的開始和結束“位置”

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

因此，這些指定1000000個范圍，長度為100到1000個單位。 現在我想知道一個位置“覆蓋”一個位置（正整數）的次數。 為此我做：

coverage <- integer(max(ends))
for(i in seq(length(starts))) {
      coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
}

但由於for循環，它相對較慢。 對於數十億的范圍，可能需要很長時間。 我找不到一種方法來矢量化這段代碼。 我可以拆分工作並使用多個CPU，但速度增益很小。 apply，lapply和其他元函數不會提高速度（如預期的那樣）。 例如

coverage <- tabulate(unlist(Map(':', starts,ends)))

由於“地圖”部分也很慢。 我擔心這也需要更多的記憶。

有任何想法嗎？

Answer 1

您可以保留在任何特定索引處開始和結束的范圍計數，然后對這些差異應用累積總和。

匯總從每個索引開始的范圍數
聚集，每個索引之前結束在一個位置范圍的數量（如果ends都包括在內）
計算凈變化： count of starts - count of ends
循環索引並累計匯總凈變化。 這將給出早於此索引開始但尚未在此索引處結束的數字范圍。

“覆蓋”數字等於每個指數的累積總和。

我嘗試使用稀疏向量來減少內存使用量。 雖然使用法向量可能會更快，但不確定。 使用sparseVector它比給定示例的循環方法快5.7倍。

library(Matrix)

set.seed(123)

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

v.cov <- NULL
fun1 <- function() {
  coverage <- integer(max(ends))
  for(i in seq(length(starts))) {
    coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
  }
  v.cov <<- coverage
}
# Testing "for loop" approach
system.time(fun1())
# user  system elapsed 
# 21.84    0.00   21.83 

v.sum <- NULL
fun2 <- function() {      
  # 1. Aggregate the number of ranges that start at each index
  t.starts <- table(starts)
  i.starts <- strtoi(names(t.starts))
  x.starts <- as.vector(t.starts)
  sv.starts <- sparseVector(x=x.starts, i=i.starts, length=max(ends)+1)  # to match length of sv.ends below
  # 2. Aggregate the number of ranges that end at one position before each index
  t.ends <- table(ends)
  i.ends <- strtoi(names(t.ends))+1  # because "ends" are inclusive 
  x.ends <- as.vector(t.ends)
  sv.ends <- sparseVector(x=x.ends, i=i.ends, length=max(ends)+1)

  sv.diff <- sv.starts - sv.ends
  v.sum <<- cumsum(sv.diff)[1:max(ends)]  # drop last element
}
# Testing "cumulative sum" approach
system.time(fun2())
# user  system elapsed 
# 3.828   0.000   3.823

identical(v.cov, v.sum)
# TRUE

此外，對於sparseVector構造函數，可能有更好的方法來提取x和i， sparseVector不是使用可能進一步提高速度的table和strtoi(names(x)) 。

編輯

避免strtoi使用1列sparseMatrix

v.sum.mat <- NULL
fun3 <- function() {
  v.ones <- rep(1, length(starts))
  m.starts <- sparseMatrix(i=starts, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
  m.ends <- sparseMatrix(i=ends+1, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
  m.diff <- m.starts - m.ends
  v.sum.mat <<- cumsum(m.diff[,1])[1:max(ends)]
}
# Testing "cumulative sum" approach using matrix
system.time(fun3())
#   user  system elapsed 
#  0.456   0.028   0.486 

identical(v.cov, v.sum.mat)
# TRUE

編輯2 - 超快，超短

基於@alexis_laz的評論，謝謝！

fun4 <- function() {
  cumsum(tabulate(starts, max(ends) + 1L) - tabulate(ends + 1L, max(ends) + 1L))[1:max(ends)]
}
system.time(v.sum.tab <- fun4())
# user  system elapsed 
# 0.040   0.000   0.041 

identical(as.integer(v.cov), v.sum.tab)
# TRUE

加速簡單的R代碼（矢量化？）

問題描述

1 個解決方案

解決方案1
3 已采納 2016-05-05 11:49:57

加速簡單的R代碼（矢量化？）

問題描述

1 個解決方案

解決方案1 3 已采納 2016-05-05 11:49:57

解決方案1
3 已采納 2016-05-05 11:49:57