简体   繁体   English

如何为大数据加速 R 中的嵌套 for 循环,目前在其中使用 append 并输出大列表? 如何矢量化?

[英]How to speed up nested for loop in R for large data, which currently uses append in it and outputs large lists? How to vectorise?

Hopefully I get this right this time around, previously I posted (although that was years ago) and I remember that I didn't have much of a good example/ detail in my question.希望这次我能做到这一点,之前我发布过(尽管那是几年前的事了),我记得我的问题中没有太多好的例子/细节。

So, I'm using the quakes dataset in R, for this example to hopefully make it easier to follow.因此,我在 R 中使用了 quakes 数据集,希望这个示例更容易理解。

Hopefully this example is clear.希望这个例子很清楚。

I have a function myfunc:我有一个 function myfunc:


    myfunc <- function(x,y){
      z <- (x - y)^2
      return(z)
    }

So what I'm trying to do, is use this function for each and every row in the Quakes dataset.所以我想要做的是,对 Quakes 数据集中的每一行都使用这个 function。 So for example using the head of the dataset:因此,例如使用数据集的头部:

> library(datasets)
> data(quakes)
> head(quakes)
     lat   long depth mag stations
1 -20.42 181.62   562 4.8       41
2 -20.62 181.03   650 4.2       15
3 -26.00 184.10    42 5.4       43
4 -17.97 181.66   626 4.1       19
5 -20.42 181.96   649 4.0       11
6 -19.68 184.31   195 4.0       12
> 

this first row would use the myfunc function with every other row in the dataset and then the same would happen with the second row for every other row in the dataset etc.第一行将使用myfunc function 和数据集中的每一行,然后第二行对于数据集中的每一行都会发生同样的情况,等等。

I'm currently using the following nested for loop and appending to a vector.我目前正在使用以下嵌套 for 循环并附加到向量。 I then cbind them all together.然后我将它们全部cbind在一起。

lat <- vector()
long <- vector()
depth <- vector()
mag <- vector()
stations <- vector()
for (i in 1:6){
  for (j in 1:6){
    lat <- append(lat,(myfunc(quakes$lat[i], quakes$lat[j])))
    long <- append(long,(myfunc(quakes$long[i], quakes$long[j])))
    depth <- append(depth,(myfunc(quakes$depth[i], quakes$depth[j])))
    mag <- append(mag,(myfunc(quakes$mag[i], quakes$mag[j])))
    stations <- append(stations,(myfunc(quakes$stations[i], quakes$stations[j])))
  }
}
final <- as.data.frame(cbind(lat, long, depth, mag, stations))

The actual data I'm doing this on, has 1244 rows and 13 columns, and doesn't seem to run the full code (or takes too long, as I usually just stop when it's nearing 1 hour).我正在执行此操作的实际数据有 1244 行和 13 列,并且似乎没有运行完整的代码(或者需要太长时间,因为我通常会在接近 1 小时时停止)。 I have tried my normal code on 191 rows and that seems to run fine, within 1 minute usually.我已经在 191 行上尝试了我的正常代码,并且通常在 1 分钟内运行良好。

I've read up online about this and it's clear that the append is not good to do in for loops.我已经在网上阅读了这方面的内容,很明显 append 在 for 循环中不好做。 I've come across sapply , vectorisation and some examples.我遇到过sapply 、矢量化和一些例子。 However I'm really struggling to get this to work and output the exact same that it does currently.但是,我真的很难让这个工作和 output 与它目前的工作完全相同。

I was wondering whether anyone has anyone can help me out with this/ has useful advice?我想知道是否有人可以帮助我解决这个问题/有有用的建议?

Thank you.谢谢你。

Update: Just to add that I'm going to be using the cbind function to bind two columns onto the results.更新:只是补充一点,我将使用 cbind function 将两列绑定到结果上。 For example if the quakes data had a letter assigned to each row ie A, B, C I would want the final output after the cbind to show from this例如,如果地震数据有一个分配给每一行的字母,即 A、B、C,我希望在 cbind 之后显示最终的 output

 ID    lat   long depth mag stations
1 A -20.42 181.62   562 4.8       41
2 B -20.62 181.03   650 4.2       15
3 C -26.00 184.10    42 5.4       43
4 D -17.97 181.66   626 4.1       19
5 E -20.42 181.96   649 4.0       11
6 F -19.68 184.31   195 4.0       12

to

 ID1 ID2   long depth mag stations
1  A   A  (row from final)
2  A   B  (row from final)
3  A   C  (row from final)
4  B   A  (row from final)
5  B   B  (row from final)
6  B   C  (row from final)

etc.等等

Currently I'm using something similar to this:目前我正在使用类似的东西:

ID1 <- vector()
ID2 <- vector()
for (i in 1:1244){
  for (j in 1:1244){
    ID1 <- append(ID1,quakes$ID[i])
    ID2 <- append(ID2,quakes$ID[j])
  }
}

It currently returns large character lists.它当前返回大型字符列表。 Do you have suggestion on how this could be improved?您对如何改进有什么建议吗?

Apologies for not mentioning this in my original post.很抱歉在我原来的帖子中没有提到这一点。

Here are two functions.这里有两个功能。
The first is my original answer made a function.According to a comment it's already faster than the original in the question but the second function is around twice as fast.第一个是我的原始答案是 function。根据评论,它已经比问题中的原始答案快,但第二个 function 的速度大约是原来的两倍。 It is also more memory efficient. memory 效率也更高。

myfunc <- function(x, y){
  z <- (x - y)^2
  return(z)
}


slower <- function(X, fun = myfunc){
  fun <- match.fun(fun)
  res <- sapply(X, function(x) {
    o <- outer(x, x, fun)
    o[row(o) != col(o)]
  })
  as.data.frame(res)
}

faster <- function(X, fun){
  f <- function(x, fun = myfunc){
    y <- lapply(seq_along(x), function(i){
      fun(x[i], x[-i])
    })
    unlist(y)
  }
  fun <- match.fun(fun)
  res <- sapply(X, f, fun = fun)
  as.data.frame(res)
}

Test both, the results are identical.测试两者,结果是相同的。

res1 <- slower(quakes, myfunc)
res2 <- faster(quakes, myfunc)
identical(res1, res2)
#[1] TRUE

Now for the timings with package microbenchmark .现在来看 package microbenchmark的时序。

library(microbenchmark)

mb <- microbenchmark(
  outer = slower(quakes, myfunc),
  fastr = faster(quakes, myfunc),
  times = 10
)
print(mb, unit = "relative", order = "median")
#Unit: relative
#  expr      min       lq     mean   median       uq      max neval cld
# fastr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 
# outer 1.545283 1.650968 1.970562 2.159856 2.762724 1.332896    10   b


ggplot2::autoplot(mb)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM