[英]Speed up simple R code (vectorize?)
I have two positive integer vectors specifying start and end "positions" of ranges 我有两个正整数向量指定范围的开始和结束“位置”
starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)
So these specify 1000000 ranges that are 100 to 1000 units long. 因此,这些指定1000000个范围,长度为100到1000个单位。 Now I want to know how many times a position (positive integer) is "covered" by a range. 现在我想知道一个位置“覆盖”一个位置(正整数)的次数。 For this I do: 为此我做:
coverage <- integer(max(ends))
for(i in seq(length(starts))) {
coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1
}
But because of the for loop, it's relatively slow. 但由于for循环,它相对较慢。 For billions of ranges, it can take a very long time. 对于数十亿的范围,可能需要很长时间。 I cannot find a way to vectorize this code. 我找不到一种方法来矢量化这段代码。 I could split the work and use multiple CPUs, but the speed gain would be marginal. 我可以拆分工作并使用多个CPU,但速度增益很小。 apply, lapply and other meta-functions do not improve speed (as expected). apply,lapply和其他元函数不会提高速度(如预期的那样)。 For instance 例如
coverage <- tabulate(unlist(Map(':', starts,ends)))
is also slow because of the "Map" part. 由于“地图”部分也很慢。 I fear it also takes more memory. 我担心这也需要更多的记忆。
Any ideas? 有任何想法吗?
You could keep a count of ranges that start and end at any specific index and then apply a cumulative sum over the difference of these. 您可以保留在任何特定索引处开始和结束的范围计数,然后对这些差异应用累积总和。
ends
are inclusive) 聚集,每个索引之前结束在一个位置范围的数量(如果ends
都包括在内) count of starts - count of ends
计算净变化: count of starts - count of ends
The "covered" number is equal to this cumulative sum at each index. “覆盖”数字等于每个指数的累积总和。
I tried this approach using sparse vectors to cut down on memory usage. 我尝试使用稀疏向量来减少内存使用量。 Although it may be faster with normal vectors, not sure. 虽然使用法向量可能会更快,但不确定。 With sparseVector
it was 5.7x faster than the loop approach for the given example. 使用sparseVector
它比给定示例的循环方法快5.7倍。
library(Matrix)
set.seed(123)
starts <- sample(10^6,replace = T)
ends <- starts+sample(100:1000,length(starts),replace=T)
v.cov <- NULL
fun1 <- function() {
coverage <- integer(max(ends))
for(i in seq(length(starts))) {
coverage[starts[i]:ends[i]] <- coverage[starts[i]:ends[i]] + 1
}
v.cov <<- coverage
}
# Testing "for loop" approach
system.time(fun1())
# user system elapsed
# 21.84 0.00 21.83
v.sum <- NULL
fun2 <- function() {
# 1. Aggregate the number of ranges that start at each index
t.starts <- table(starts)
i.starts <- strtoi(names(t.starts))
x.starts <- as.vector(t.starts)
sv.starts <- sparseVector(x=x.starts, i=i.starts, length=max(ends)+1) # to match length of sv.ends below
# 2. Aggregate the number of ranges that end at one position before each index
t.ends <- table(ends)
i.ends <- strtoi(names(t.ends))+1 # because "ends" are inclusive
x.ends <- as.vector(t.ends)
sv.ends <- sparseVector(x=x.ends, i=i.ends, length=max(ends)+1)
sv.diff <- sv.starts - sv.ends
v.sum <<- cumsum(sv.diff)[1:max(ends)] # drop last element
}
# Testing "cumulative sum" approach
system.time(fun2())
# user system elapsed
# 3.828 0.000 3.823
identical(v.cov, v.sum)
# TRUE
Also, there is probably a better way to extract x's and i's for sparseVector
constructor than using table
and strtoi(names(x))
that may boost speed further. 此外,对于sparseVector
构造函数,可能有更好的方法来提取x和i, sparseVector
不是使用可能进一步提高速度的table
和strtoi(names(x))
。
EDIT 编辑
Avoid strtoi
using a 1-column sparseMatrix
instead 避免strtoi
使用1列sparseMatrix
v.sum.mat <- NULL
fun3 <- function() {
v.ones <- rep(1, length(starts))
m.starts <- sparseMatrix(i=starts, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
m.ends <- sparseMatrix(i=ends+1, j=v.ones, x=v.ones, dims=c(max(ends)+1,1))
m.diff <- m.starts - m.ends
v.sum.mat <<- cumsum(m.diff[,1])[1:max(ends)]
}
# Testing "cumulative sum" approach using matrix
system.time(fun3())
# user system elapsed
# 0.456 0.028 0.486
identical(v.cov, v.sum.mat)
# TRUE
EDIT 2 - super fast, super short 编辑2 - 超快,超短
Based on comment by @alexis_laz, thank you! 基于@alexis_laz的评论,谢谢!
fun4 <- function() {
cumsum(tabulate(starts, max(ends) + 1L) - tabulate(ends + 1L, max(ends) + 1L))[1:max(ends)]
}
system.time(v.sum.tab <- fun4())
# user system elapsed
# 0.040 0.000 0.041
identical(as.integer(v.cov), v.sum.tab)
# TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.