简体   繁体   English

R for循环快于sapply

[英]R for loop faster than sapply

Whenever I replace a for loop with an apply statement, my R scripts run faster but here's an exception. 每当我用apply语句替换for循环时,我的R脚本运行得更快,但这是一个例外。 I'm still inexperienced in using the apply family correctly, so what I can do to the apply statements to outperform (ie. become faster) than the for loop? 我仍然没有正确使用Apply系列的经验,因此我可以对apply语句执行比for循环更好的操作(即变得更快)吗?

Example data: 示例数据:

vc<-as.character(c("120,129,129,114","103,67,67,67,67,10,10,10,12","2,1,1,1,2,4,3,1,1,1,3,2,1,1","1,3,1,1,1,1,1,4",NA,"5","1,1,99","2,2,2,16,11,11,11,11,11,29,29,26,26,26,26,26,26,26,26,26,26,31,24,29,29,29,29,40,24,23,3,3,3,6,6,4,5,4,4,3,3,4,4,6,8,8,6,6,6,5,3,3,4,4,5,5,4,4,4,4,6,11,10,11,10,14,2,2,22,22,22,22,24,24,24,23,24,24,24,23,24,23,23,23,24,25,27,27,24,24,26,24,25,25,24,25,26,29,31,32,32,32,32,33,32,35,35,35,52,44,37,26","20,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,1,1,1,12,10","67,63,73,70,75,135,94,94,96,94,95,96,96,97,94,94,94,94,24,24,24,24,24,24,24,24,24,24,24,1,1,1"))

The goal is to populate a numeric matrix m.res where each row contains the top3 values of each element in vc. 目标是填充一个数值矩阵m.res,其中每一行都包含vc中每个元素的top3值。 Here's the for loop: 这是for循环:

fx.test1 
function(vc) 
     {
     m.res<-matrix(ncol=3, nrow=length(vc))
     for (j in 1:length(vc)) 
      {vn<-as.numeric(unlist(strsplit(vc[j], split=","))) 
      vn[is.na(vn)]<-0; vn2<-rev(sort(vn)) 
      m.res[j,]<-vn2[1:3]
      }
     }

And below is my "apply solution". 下面是我的“应用解决方案”。 Why is it slower? 为什么慢呢? How can I make it faster? 我怎样才能使其更快? Thank you! 谢谢!

fx.test2
function(vc) 
    {
    m.res<-matrix(ncol=3, nrow=length(vc))
    vc[is.na(vc)]<-"0"
    ls.vc<-sapply(vc, function(x) tail(sort(as.numeric(unlist(strsplit(x, split=",")))),3), simplify=TRUE)
    #names(ls.vc)<-seq(1:length(vc))
    ls.vc2<-lapply(ls.vc, function(x) c(as.numeric(x), rep(0, times = 3 - length(x))))
    m.res<-as.matrix(t(as.data.frame(ls.vc)))
    return(m.res)
}

system.time(m.res<-fx.test1(vc))
#   user  system elapsed 
#  0.001   0.000   0.001 

system.time(m.res<-fx.test2(vc))
#   user  system elapsed 
#  0.003   0.000   0.003

UPDATE: I followed the suggestions of @John and generated two trimmed & truly equivalent functions. 更新:我遵循@John的建议,并生成了两个修整且真正等效的函数。 Indeed, I was able to speed up the lapply function somewhat but it's still SLOWER than the for loop. 确实,我能够加快lapply函数的速度,但它仍然比for循环慢。 If you happen to have any ideas for how optimize these functions for speed, please let me know. 如果您对如何优化这些功能以提高速度有任何想法,请告诉我。 Thank you all. 谢谢你们。

fx.test3<-function(vc) 
{
    L<-strsplit(vc,split=",")
    m.res<-matrix(ncol=3, nrow=length(vc))
    for (j in 1:length(vc)) 
        {
        m.res[j,]<-sort(c(as.numeric(L[[j]]),rep(0,3)), decreasing=TRUE)[1:3]
    }
    return(m.res)
}



fx.test4<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        D<-t(as.data.frame(lapply(L, function(X) {sort(c(as.numeric(X),rep(0,3)),decreasing=TRUE)[1:3]})))
        row.names(D)<-NULL
        m.res<-as.matrix(D)
        return(m.res)
    }

system.time(fx.test3(vc))
#   user  system elapsed 
#  0.001   0.000   0.001

system.time(fx.test4(vc))
#   user  system elapsed 
#  0.002   0.000   0.002 

UPDATE2 & potential answer: UPDATE2和潜在答案:

I now simplified fx.test4 as follows and it is now equivalent in speed to the for loop. 现在,我将fx.test4简化如下,它的速度与for循环相当。 Therefore, it was the extra conversion steps that made the lapply solution slower as @John pointed out. 因此,正如@John所指出的,正是额外的转换步骤使lapply解决方案变慢了。 In addition, maybe the assumption that *apply HAD to be faster was faulty as discussed by @Ari B. Friedman and @SimonO101 Thank you All! 另外,@ Ari B. Friedman和@ SimonO101讨论的*应用HAD更快的假设可能是错误的。谢谢大家!

fx.test5<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        m.res<-t(sapply(seq_along(L), function(X){sort(c(as.numeric(L[[X]]),rep(0,3)),decreasing=TRUE)[1:3]}))
        return(m.res)
    }

fx.test5(vc)
      [,1] [,2] [,3]
 [1,]  129  129  120
 [2,]  103   67   67
 [3,]    4    3    3
 [4,]    4    3    1
 [5,]    0    0    0
 [6,]    5    0    0
 [7,]   99    1    1
 [8,]   52   44   40
 [9,]   20   19   19
[10,]  135   97   96

system.time(fx.test5(vc))
   user  system elapsed 
  0.001   0.000   0.001 

UPDATE3: Indeed, on a longer example, the *apply function is faster (by a hair). UPDATE3:的确,在更长的示例中,* apply函数的速度更快(通过头发)。

system.time(fx.test3(vc2))
#   user  system elapsed 
#  3.596   0.006   3.601 
system.time(fx.test5(vc2))
#   user  system elapsed 
#  3.355   0.006   3.359

Your problem can be solved using concat.split function from splitstackshape package: 您可以使用splitstackshape包中的concat.split函数解决您的问题:

library(splitstackshape)
kk<-data.frame(vc)
nn<-concat.split(kk,split.col="vc",sep=",")
head(nn[1:10,1:4])
                           vc vc_1 vc_2 vc_3
1             120,129,129,114  120  129  129
2 103,67,67,67,67,10,10,10,12  103   67   67
3 2,1,1,1,2,4,3,1,1,1,3,2,1,1    2    1    1
4             1,3,1,1,1,1,1,4    1    3    1
5                        <NA>   NA   NA   NA
6                           5    5   NA   NA

You can manipulate the nn dataframe to get the columns with max value. 您可以操纵nn数据框以获取具有最大值的列。

You're doing lots of stuff in your loops, apply or for , that shouldn't be. 您在循环中做了很多事情, applyfor ,这不应该。 The main feature of apply is not so much that it is faster than for but that it encourages expression that allows you to keep things vectorized as much as possible (ie as little in your loops as possible). apply的主要功能不是那么快for而是鼓励表达,使您可以使向量保持尽可能多的向量化(即,在循环中尽可能少)。 The thing that R is particularly slow at is interpreting a function call and each time through the loop it needs to interpret every function call it encounters. R特别慢的是解释一个函数调用,并且每次通过循环它都需要解释它遇到的每个函数调用。 Sometimes loops are unavoidable but they should be made as small as possible. 有时循环是不可避免的,但应使其尽可能小。

Your strsplit can just be used outside the first sapply. 您的strsplit只能在第一个应用之外使用。 That way you call it once. 这样,您只需调用一次即可。 Then you also don't need unlist before as.numeric . 然后,您也不需要在unlist之前as.numeric You can also sort with decreasing = FALSE instead of additionally calling tail (although maybe that's as fast as a [1:3] selector). 您也可以使用decreasing = FALSE sortdecreasing = FALSE不用另外调用tail (尽管也许和[1:3]选择器一样快)。 All of that saves you function interpretation in your loop being called over and over. 所有这些都可以避免您的函数解释被一遍又一遍地调用。

You don't have to pre-allocate your matrix because you're going to generate the values all at once and shape them into a matrix. 您不必预先分配矩阵,因为您将立即生成所有值并将它们整形为矩阵。

See if following that advice speeds things up. 查看是否遵循该建议可以加快速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM