简体   繁体   English

R:在非常大的数据帧上加速for循环?

[英]R: Speed up a for loop on a very large data frame?

I have a huge set of coordinates with associated Z-values. 我有一组巨大的坐标和相关的Z值。 Some of the pairs of coordinates are repeated several times with different Z values. 一些坐标对用不同的Z值重复几次。 I want to obtain the mean of all Z-values for each unique pair of coordinates. 我想获得每个唯一坐标对的所有Z值的平均值。

I wrote a small line of code that works perfectly fine on a small data frame. 我写了一小段代码,在小数据帧上完美运行。 The problem is that my actual data frame has more than 2 millions rows and the computation takes >10 hours to complete. 问题是我的实际数据帧有超过2百万行,计算完成时间大于10个小时。 I was wondering if there could be a way to make it more efficient and reduce the computation time. 我想知道是否有办法让它更有效率并减少计算时间。

Here is what my df looks like: 这是我的df的样子:

> df
           x        y         Z                                 xy
1  -54.60417 4.845833 0.3272980 -54.6041666666667/4.84583333333333
2  -54.59583 4.845833 0.4401644 -54.5958333333333/4.84583333333333
3  -54.58750 4.845833 0.5788663          -54.5875/4.84583333333333
4  -54.57917 4.845833 0.6611844 -54.5791666666667/4.84583333333333
5  -54.57083 4.845833 0.7830828 -54.5708333333333/4.84583333333333
6  -54.56250 4.845833 0.8340629          -54.5625/4.84583333333333
7  -54.55417 4.845833 0.8373666 -54.5541666666667/4.84583333333333
8  -54.54583 4.845833 0.8290986 -54.5458333333333/4.84583333333333
9  -54.57917 4.845833 0.9535526 -54.5791666666667/4.84583333333333
10 -54.59583 4.837500 0.0000000           -54.5958333333333/4.8375
11 -54.58750 4.845833 0.8582580          -54.5875/4.84583333333333
12 -54.58750 4.845833 0.3857006          -54.5875/4.84583333333333

You can see that some xy coordinates are the same (eg row 3,11,12 or 4 and 9) and I want the mean Z values of all these identical coordinates. 您可以看到一些xy坐标是相同的(例如,行3,11,12或4和9),我想要所有这些相同坐标的平均Z值。 So here is my script: 所以这是我的脚本:

mean<-vector(mode = "numeric",length = length(df$x))

for (i in 1:length(df$x)){
  mean(df$Z[which(df$xy==df$xy[i])])->mean[i]
} 
mean->df$mean
df<-df[,-(3:4)]
df<-unique(df)

And I get something like this: 我得到这样的东西:

> df
           x        y      mean
1  -54.60417 4.845833 0.3272980
2  -54.59583 4.845833 0.4401644
3  -54.58750 4.845833 0.6076083
4  -54.57917 4.845833 0.8073685
5  -54.57083 4.845833 0.7830828
6  -54.56250 4.845833 0.8340629
7  -54.55417 4.845833 0.8373666
8  -54.54583 4.845833 0.8290986
10 -54.59583 4.837500 0.0000000

That does the work, but surely there is a way to speed up this process (probably without the for loop) for a df with a much larger number of rows? 这样做是可行的,但是肯定有一种方法可以加速这个过程(可能没有for循环)的df有更多的行数?

Welcome! 欢迎! In future it would be best to offer a quick way for us to copy and paste some code that generates the essential features of the dataset you're working with. 在将来,最好为我们提供一种快速方式来复制和粘贴一些代码,这些代码生成您正在使用的数据集的基本功能。 Here is an example I think: 这是我想的一个例子:

DF <- data.frame(x = sample(c(-54.1, -54.2), size = 10, replace = TRUE),
                 y = sample(c(4.8, 4.4), size = 10, replace = TRUE),
                 z = runif(10))

This looks to be just a split apply combine approach: 这看起来只是一个拆分应用组合方法:

set.seed(1)
df <- data.frame(x = sample(c(-54.1, -54.2), size = 10, replace = TRUE),
                 y = sample(c(4.8, 4.4), size = 10, replace = TRUE),
                 z = runif(10))

library(data.table)
DT <- as.data.table(df)
DT[, .(mean_z = mean(z)), keyby = c("x", "y")]
#>        x   y    mean_z
#> 1: -54.2 4.4 0.3491507
#> 2: -54.2 4.8 0.4604533
#> 3: -54.1 4.4 0.3037848
#> 4: -54.1 4.8 0.5734239

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%
  group_by(x, y) %>%
  summarise(mean_z = mean(z))
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x     y mean_z
#>   <dbl> <dbl>  <dbl>
#> 1 -54.2   4.4  0.349
#> 2 -54.2   4.8  0.460
#> 3 -54.1   4.4  0.304
#> 4 -54.1   4.8  0.573

Created on 2018-09-21 by the reprex package (v0.2.1) reprex包创建于2018-09-21(v0.2.1)

You could try dplyr::summarise . 你可以尝试dplyr::summarise

library(dplyr)
df %>%
  group_by(x, y) %>%
  summarise(meanZ = mean(Z))

I'd guess this would take less than a minute, depending on your machine. 我想这可能需要不到一分钟,具体取决于你的机器。

Someone else might provide a data.table answer, which may be even quicker. 其他人可能会提供data.table答案,这可能会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM