简体   繁体   English

加速简单的R代码(矢量化?)

[英]Speed up simple R code (vectorize?)

I have two positive integer vectors specifying start and end "positions" of ranges 我有两个正整数向量指定范围的开始和结束“位置”

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

So these specify 1000000 ranges that are 100 to 1000 units long. 因此,这些指定1000000个范围,长度为100到1000个单位。 Now I want to know how many times a position (positive integer) is "covered" by a range. 现在我想知道一个位置“覆盖”一个位置(正整数)的次数。 For this I do: 为此我做:

coverage <- integer(max(ends))
for(i in seq(length(starts))) {
      coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
}

But because of the for loop, it's relatively slow. 但由于for循环,它相对较慢。 For billions of ranges, it can take a very long time. 对于数十亿的范围,可能需要很长时间。 I cannot find a way to vectorize this code. 我找不到一种方法来矢量化这段代码。 I could split the work and use multiple CPUs, but the speed gain would be marginal. 我可以拆分工作并使用多个CPU,但速度增益很小。 apply, lapply and other meta-functions do not improve speed (as expected). apply,lapply和其他元函数不会提高速度(如预期的那样)。 For instance 例如

coverage <- tabulate(unlist(Map(':', starts,ends)))

is also slow because of the "Map" part. 由于“地图”部分也很慢。 I fear it also takes more memory. 我担心这也需要更多的记忆。

Any ideas? 有任何想法吗?

You could keep a count of ranges that start and end at any specific index and then apply a cumulative sum over the difference of these. 您可以保留在任何特定索引处开始和结束的范围计数,然后对这些差异应用累积总和。

  1. Aggregate the number of ranges that start at each index 汇总从每个索引开始的范围数
  2. Aggregate the number of ranges that end at one position before each index (if ends are inclusive) 聚集,每个索引之前结束在一个位置范围的数量(如果ends都包括在内)
  3. Calculate the net change: count of starts - count of ends 计算净变化: count of starts - count of ends
  4. Loop over indexes and sum up the net changes cumulatively. 循环索引并累计汇总净变化。 This will give the number ranges that started earlier than this index and not ended yet at this index. 这将给出早于此索引开始但尚未在此索引处结束的数字范围。

The "covered" number is equal to this cumulative sum at each index. “覆盖”数字等于每个指数的累积总和。

I tried this approach using sparse vectors to cut down on memory usage. 我尝试使用稀疏向量来减少内存使用量。 Although it may be faster with normal vectors, not sure. 虽然使用法向量可能会更快,但不确定。 With sparseVector it was 5.7x faster than the loop approach for the given example. 使用sparseVector它比给定示例的循环方法快5.7倍。

library(Matrix)

set.seed(123)

starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)

v.cov <- NULL
fun1 <- function() {
  coverage <- integer(max(ends))
  for(i in seq(length(starts))) {
    coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1 
  }
  v.cov <<- coverage
}
# Testing "for loop" approach
system.time(fun1())
# user  system elapsed 
# 21.84    0.00   21.83 

v.sum <- NULL
fun2 <- function() {      
  # 1. Aggregate the number of ranges that start at each index
  t.starts <- table(starts)
  i.starts <- strtoi(names(t.starts))
  x.starts <- as.vector(t.starts)
  sv.starts <- sparseVector(x=x.starts, i=i.starts, length=max(ends)+1)  # to match length of sv.ends below
  # 2. Aggregate the number of ranges that end at one position before each index
  t.ends <- table(ends)
  i.ends <- strtoi(names(t.ends))+1  # because "ends" are inclusive 
  x.ends <- as.vector(t.ends)
  sv.ends <- sparseVector(x=x.ends, i=i.ends, length=max(ends)+1)

  sv.diff <- sv.starts - sv.ends
  v.sum <<- cumsum(sv.diff)[1:max(ends)]  # drop last element
}
# Testing "cumulative sum" approach
system.time(fun2())
# user  system elapsed 
# 3.828   0.000   3.823

identical(v.cov, v.sum)
# TRUE

Also, there is probably a better way to extract x's and i's for sparseVector constructor than using table and strtoi(names(x)) that may boost speed further. 此外,对于sparseVector构造函数,可能有更好的方法来提取x和i, sparseVector不是使用可能进一步提高速度的tablestrtoi(names(x))

EDIT 编辑

Avoid strtoi using a 1-column sparseMatrix instead 避免strtoi使用1列sparseMatrix

v.sum.mat <- NULL
fun3 <- function() {
  v.ones <- rep(1, length(starts))
  m.starts <- sparseMatrix(i=starts, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
  m.ends <- sparseMatrix(i=ends+1, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
  m.diff <- m.starts - m.ends
  v.sum.mat <<- cumsum(m.diff[,1])[1:max(ends)]
}
# Testing "cumulative sum" approach using matrix
system.time(fun3())
#   user  system elapsed 
#  0.456   0.028   0.486 

identical(v.cov, v.sum.mat)
# TRUE

EDIT 2 - super fast, super short 编辑2 - 超快,超短

Based on comment by @alexis_laz, thank you! 基于@alexis_laz的评论,谢谢!

fun4 <- function() {
  cumsum(tabulate(starts, max(ends) + 1L) - tabulate(ends + 1L, max(ends) + 1L))[1:max(ends)]
}
system.time(v.sum.tab <- fun4())
# user  system elapsed 
# 0.040   0.000   0.041 

identical(as.integer(v.cov), v.sum.tab)
# TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM