[英]Efficient dataframe iteration in R
Suppose I have aa 5 million row data frame, with two columns, as such (this data frame only has ten rows for simplicity): 假设我有一个500万行的数据帧,有两列,例如(为简单起见,此数据帧只有十行):
df <- data.frame(start=c(11,21,31,41,42,54,61,63), end=c(20,30,40,50,51,63,70,72))
I want to be able to produce the following numbers in a numeric vector: 我希望能够在数字向量中产生以下数字:
11 to 20, 21 to 30, 31 to 40, 41 to 50, 51, 54-63, 64-70, 71-72
And then take the length of the new vector (in this case, 10+10+10+10+1+10+7+2) = 60 然后取新向量的长度(在这种情况下,为10 + 10 + 10 + 10 + 1 + 10 + 7 + 2)= 60
*NOTE, I do not need the vector itself, just it's length will suffice. *注意,我不需要向量本身,只要它的长度就足够了。 So if someone has a more intelligent logical approach to obtain the length, that is welcomed.
因此,如果有人采用更智能的逻辑方法来获取长度,那是值得欢迎的。
Essentially, what was done, was the for each row in the dataframe, the sequence from the start to end was taken, and all these sequences were combined, and then filtered for UNIQUE values. 从本质上讲,要做的是对数据帧中的每一行执行,从头到尾采用顺序,然后将所有这些顺序组合在一起,然后针对UNIQUE值进行过滤。
So I used an approach as such: 所以我使用了这样的方法:
length(unique(c(apply(df, 1, function(x) {
return(as.numeric(x[1]):as.numeric(x[2]))
}))))
which proves incredibly slow on my five million row data frame. 在我的500万行数据帧上,事实证明这非常慢。
Any quicker more efficient solutions? 有更快,更有效的解决方案吗? Bonus, please try to add system time.
奖励,请尝试增加系统时间。
user system elapsed 19.946 0.620 20.477 用户系统经过的时间19.946 0.620 20.477
This should work, assuming your data is sorted. 假设您的数据已排序,这应该可以工作。
library(dplyr) # for the lag function
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
#[1] 60
library(microbenchmark)
microbenchmark(
beginneR={with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))},
r2evans={vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))); sum(mm[,2]-vec+1);},
times = 1000
)
Unit: microseconds
expr min lq median uq max neval
beginneR 37.398 41.4455 42.731 44.0795 74.349 1000
r2evans 31.788 35.2470 36.827 38.3925 9298.669 1000
So matrix is still faster, but not much (and the conversion step is still not included here). 因此,矩阵仍然更快,但速度并不快(转换步骤仍未包括在此处)。 And I wonder why the max duration in @r2evans's answer is so high compared to all other values (which are really fast)
而且我不知道为什么@ r2evans回答中的最大持续时间与所有其他值相比是如此之高(这确实非常快)
Another method: 另一种方法:
mm <- as.matrix(df) ## critical for performance/scalability
(vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))))
## [1] 11 21 31 41 51 54 64 71
sum(mm[,2] - vec + 1)
## [1] 60
(This should scale reasonable well, certainly better than data.frames.) (这应该合理地缩放,当然比data.frames要好。)
Edit : after I updated my code to use matrices and no apply
calls, I did a quick benchmark of my implementation compared with the other answer (which is also correct): 编辑 :更新代码以使用矩阵并且不
apply
调用之后,与其他答案相比,我对实现进行了快速基准测试(也是正确的):
library(microbenchmark)
library(dplyr)
microbenchmark(
beginneR={
df <- data.frame(start=c(11,21,31,41,42,54,61,63),
end=c(20,30,40,50,51,63,70,72))
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
},
r2evans={
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
}
)
## Unit: microseconds
## expr min lq median uq max neval
## beginneR 230.410 238.297 244.9015 261.228 443.574 100
## r2evans 37.791 40.725 44.7620 47.880 147.124 100
This benefits greatly from the use of matrices instead of data.frames. 这大大受益于使用矩阵而不是data.frames。
Oh, and system time is not that helpful here :-) 哦,系统时间在这里没有帮助:-)
system.time({
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
})
## user system elapsed
## 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.