简体   繁体   English

R中高效的数据帧迭代

[英]Efficient dataframe iteration in R

Suppose I have aa 5 million row data frame, with two columns, as such (this data frame only has ten rows for simplicity): 假设我有一个500万行的数据帧,有两列,例如(为简单起见,此数据帧只有十行):

df <- data.frame(start=c(11,21,31,41,42,54,61,63), end=c(20,30,40,50,51,63,70,72))

I want to be able to produce the following numbers in a numeric vector: 我希望能够在数字向量中产生以下数字:

11 to 20, 21 to 30, 31 to 40, 41 to 50, 51, 54-63, 64-70, 71-72

And then take the length of the new vector (in this case, 10+10+10+10+1+10+7+2) = 60 然后取新向量的长度(在这种情况下,为10 + 10 + 10 + 10 + 1 + 10 + 7 + 2)= 60

*NOTE, I do not need the vector itself, just it's length will suffice. *注意,我不需要向量本身,只要它的长度就足够了。 So if someone has a more intelligent logical approach to obtain the length, that is welcomed. 因此,如果有人采用更智能的逻辑方法来获取长度,那是值得欢迎的。

Essentially, what was done, was the for each row in the dataframe, the sequence from the start to end was taken, and all these sequences were combined, and then filtered for UNIQUE values. 从本质上讲,要做的是对数据帧中的每一行执行,从头到尾采用顺序,然后将所有这些顺序组合在一起,然后针对UNIQUE值进行过滤。

So I used an approach as such: 所以我使用了这样的方法:

length(unique(c(apply(df, 1, function(x) {
    return(as.numeric(x[1]):as.numeric(x[2]))
}))))

which proves incredibly slow on my five million row data frame. 在我的500万行数据帧上,事实证明这非常慢。

Any quicker more efficient solutions? 有更快,更有效的解决方案吗? Bonus, please try to add system time. 奖励,请尝试增加系统时间。

user system elapsed 19.946 0.620 20.477 用户系统经过的时间19.946 0.620 20.477

This should work, assuming your data is sorted. 假设您的数据已排序,这应该可以工作。

library(dplyr)  # for the lag function

with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
#[1] 60

library(microbenchmark)
microbenchmark(
  beginneR={with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))},
  r2evans={vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))); sum(mm[,2]-vec+1);},
  times = 1000
)

Unit: microseconds
     expr     min       lq  median       uq       max neval
beginneR   37.398  41.4455  42.731  44.0795    74.349  1000
r2evans    31.788  35.2470  36.827  38.3925  9298.669  1000

So matrix is still faster, but not much (and the conversion step is still not included here). 因此,矩阵仍然更快,但速度并不快(转换步骤仍未包括在此处)。 And I wonder why the max duration in @r2evans's answer is so high compared to all other values (which are really fast) 而且我不知道为什么@ r2evans回答中的最大持续时间与所有其他值相比是如此之高(这确实非常快)

Another method: 另一种方法:

mm <- as.matrix(df) ## critical for performance/scalability
(vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))))
##  [1] 11 21 31 41 51 54 64 71
sum(mm[,2] - vec + 1)
##  [1] 60

(This should scale reasonable well, certainly better than data.frames.) (这应该合理地缩放,当然比data.frames要好。)

Edit : after I updated my code to use matrices and no apply calls, I did a quick benchmark of my implementation compared with the other answer (which is also correct): 编辑 :更新代码以使用矩阵并且不apply调用之后,与其他答案相比,我对实现进行了快速基准测试(也是正确的):

library(microbenchmark)
library(dplyr)
microbenchmark(
    beginneR={
        df <- data.frame(start=c(11,21,31,41,42,54,61,63),
                         end=c(20,30,40,50,51,63,70,72))
        with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
    },
    r2evans={
        mm <- matrix(c(11,21,31,41,42,54,61,63,
                       20,30,40,50,51,63,70,72), nc=2)
        vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
        sum(mm[,2]-vec+1)
    }
    )
##  Unit: microseconds
##       expr     min      lq   median      uq     max neval
##   beginneR 230.410 238.297 244.9015 261.228 443.574   100
##    r2evans  37.791  40.725  44.7620  47.880 147.124   100

This benefits greatly from the use of matrices instead of data.frames. 这大大受益于使用矩阵而不是data.frames。

Oh, and system time is not that helpful here :-) 哦,系统时间在这里没有帮助:-)

system.time({
    mm <- matrix(c(11,21,31,41,42,54,61,63,
                   20,30,40,50,51,63,70,72), nc=2)
    vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
    sum(mm[,2]-vec+1)
})
##     user  system elapsed 
##        0       0       0 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM