简体   繁体   English

加快在R中庞大数据集上计算mann-kendall检验的并行过程

[英]Speed-up a parallel process calculating a mann-kendall test over a huge dataset in R

Let's assume to have a large dataset of climatic data at monthly time steps for a large number of points in the world. 假设每个月的时间步长都有一个庞大的气候数据集,用于世界上的许多点。 Then dataset is shaped as a data.frame of the type: 然后将数据集定型为类型为data.frame的数据:

lon, lat, data_month_1_yr_1, ..., data_month_12_yr_100 lon,lat,data_month_1_yr_1,...,data_month_12_yr_100

Example: 例:

set.seed(123)
data<- data.frame(cbind(runif(10000,-180,180), runif(10000,-90,90))
, replicate(1200, runif(10000,0,150)))

I would like to perform a Mann-Kendall test (using trend::mk.test ) over the monthly time series of each of the spatial points and get the main statistics in a data.frame . 我想在每个空间点的每月时间序列上执行Mann-Kendall测试(使用trend::mk.test ),并在data.frame获取主要统计信息。 In order to speed up this very long process I parallelized my code and wrote something like the following: 为了加快这个非常漫长的过程,我并行化了我的代码并编写了如下内容:

coords<-data[,1:2] #get the coordinates out of the initial dataset
names(coords)<-c("lon","lat") 
data_t<- as.data.frame(t(data[,3:1202])) #each column is now the time series associated to a point
data_t$month<-rep(seq(1,12,1),100) # month index as last column of the data frame
# start the parallel processing

library(foreach)
library(doParallel)

cores=detectCores() #count cores
cl <- makeCluster(cores[1]-1) #take all the cores minus 1 not to overload the pc
registerDoParallel(cl)

mk_out<- foreach(m=1:12, .combine = rbind) %:%
         foreach (a =1:10000, .combine = rbind) %dopar% {

           data_m<-data_t[which(data_t$month==m),]
           library(trend) #need to load this all the times otherwise I get an error (don't know why)
           test<-mk.test(data_m[,a])
           mk_out_temp <- data.frame("lon"=coords[a,1],
                                     "lat"=coords[a,2],
                                     "p.value" = as.numeric(test$p.value),
                                     "z_stat" = as.numeric(test$statistic),
                                     "tau" = as.numeric(test$estimates[3]),
                                     "month"= as.numeric(m))
           mk_out_temp
}
stopCluster(cl)

head(mk_out)
         lon       lat    p.value     z_stat         tau month
1  -76.47209 -34.09350 0.57759040 -0.5569078 -0.03797980     1
2  103.78985 -31.58639 0.64436238  0.4616081  0.03151515     1
3  -32.76831  66.64575 0.11793238  1.5635113  0.10626263     1
4  137.88627 -30.83872 0.79096910  0.2650524  0.01818182     1
5  158.56822 -67.37378 0.09595919 -1.6647673 -0.11313131     1
6 -163.59966 -25.88014 0.82325630  0.2233588  0.01535354     1

This runs just fine and gives me exactly what I am after: a matrix reporting the MK statistics for each combination of coordinates and month. 这运行得很好,并且给我确切的信息:一个矩阵,报告每个坐标和月份组合的MK统计信息。 Although the process is parallelized, however, the computation takes still a considerable amount of time. 尽管该过程是并行的,但是计算仍需要花费大量时间。

Is there a way to speed up this process? 有没有办法加快这个过程? Any room for using functions from the apply family? apply家庭有使用职能的空间吗?

You note that you have already fixed your problem. 您注意到您已经解决了问题。 Is obtainable using one of the following steps: 可通过以下步骤之一获得:

1: Copy the necessary objects to the foreach loops using .packages and .export . 1:使用.packages.export将必要的对象复制到foreach循环中。 This ensures that each instance will not clash when trying to access the same memory. 这样可以确保在尝试访问同一内存时每个实例不会冲突。

2: Utilizing high performance libraries such as tidyverse of data.table to perform subsetting and computation. 2:利用诸如data.table的tidyverse之类的高性能库来执行子集和计算。

The latter is a bit more complicated but yielded a massive boost to performance on my tiny tiny laptop. 后者稍微复杂一点,但极大地提高了我的微型笔记本电脑的性能。 (Performing all calculations i roughly 1.5 minutes for the entire dataset.) (对整个数据集执行所有计算大约需要1.5分钟。)

Below is my added code. 以下是我添加的代码。 Note that i replaced foreach with a single parLapply function from the parallel package. 请注意,我用并行包中的单个parLapply函数替换了foreach。

set.seed(123)
data<- data.frame(cbind(runif(10000,-180,180), runif(10000,-90,90))
                  , replicate(1200, runif(10000,0,150)))

coords<-data[,1:2] #get the coordinates out of the initial dataset
names(coords)<-c("lon","lat") 
data_t<- as.data.frame(t(data[,3:1202])) #each column is now the time series associated to a point
data_t$month<-rep(seq(1,12,1),100) # month index as last column of the data frame
# start the parallel processing

library(data.table)
library(parallel)
library(trend)
setDT(data_t)
setDT(coords)
cores=detectCores() #count cores
cl <- makeCluster(cores[1]-1) #take all the cores minus 1 not to overload the pc

#user  system elapsed 
#17.80   35.12   98.72
system.time({
  test <- data_t[,parLapply(cl, 
                            .SD, function(x){
                              (
                                unlist(
                                  trend::mk.test(x)[c("p.value","statistic","estimates")]
                                )
                               )
                              }
                            ), by = month] #Perform the calculations across each month
  #create a column that indicates what each row is measuring
  rows <- rep(c("p.value","statistic.z","estimates.S","estimates.var","estimates.tau"),12)

  final_tests <- dcast( #Cast the melted structure to a nice form
                      melt(cbind(test,rowname = rows), #Melt the data for a better structure
                        id.vars = c("rowname","month"), #Grouping variables
                        measure.vars = paste0("V",seq.int(1,10000))), #variable names
                      month + variable ~ rowname, #LHS groups the data along rows, RHS decides the value columns
                      value.var = "value", #Which column contain values? 
                      drop = TRUE) #should we drop unused columns? (doesnt matter here)
  #rename the columns as desired
  names(final_tests) <- c("month","variable","S","tau","var","p.value","z_stat")
  #finally add the coordinates
  final_tests <- cbind(final_form,coords) 
})

At the end the problem was easily addressed by replacing the second loop with a lapply function (inspired by this answer ). 最后,可以通过用lapply函数替换第二个循环来轻松解决此问题 (受此答案启发)。 The execution time is now contained to just few seconds. 现在,执行时间仅为几秒钟。 Vectorizing remains the best solution to execution times in R (see this post and this ) 向量化仍然是解决R中执行时间的最佳解决方案(请参阅此职位this

I share the final code here below for reference: 我在下面共享最终代码以供参考:

set.seed(123)
data<- data.frame(cbind(runif(10000,-180,180), runif(10000,-90,90)), replicate(1200, runif(10000,0,150)))
coords<-data[,1:2]
names(coords)<-c("lon","lat")
data_t<- as.data.frame(t(data[,3:1202]))
data_t$month<-rep(seq(1,12,1),100)


library(foreach)
library(doParallel)

cores=detectCores()
cl <- makeCluster(cores[1]-1) #take all the cores minus 1
registerDoParallel(cl)

mk_out<- foreach(m=1:12, .combine = rbind) %dopar% {
    data_m<-data_t[which(data_t$month==m),]
    library(trend)
    mk_out_temp <- do.call(rbind,lapply(data_m[1:100],function(x)unlist(mk.test(x))))
    mk_out_temp <-cbind(coords,mk_out_temp,rep(m,dim(coords)[1]))
    mk_out_temp
  }
stopCluster(cl)


head(mk_out)

head(mk_out)
         lon       lat data.name            p.value        statistic.z null.value.S parameter.n estimates.S estimates.varS
1  -76.47209 -34.09350         x  0.577590398263635 -0.556907839290681            0         100        -188         112750
2  103.78985 -31.58639         x  0.644362383361713  0.461608102085858            0         100         156         112750
3  -32.76831  66.64575         x  0.117932376736468   1.56351131351662            0         100         526         112750
4  137.88627 -30.83872         x   0.79096910003836  0.265052394100912            0         100          90         112750
5  158.56822 -67.37378         x 0.0959591933285242  -1.66476728429674            0         100        -560         112750
6 -163.59966 -25.88014         x  0.823256299016955  0.223358759073802            0         100          76         112750
       estimates.tau alternative                  method              pvalg rep(m, dim(coords)[1])
1 -0.037979797979798   two.sided Mann-Kendall trend test  0.577590398263635                      1
2 0.0315151515151515   two.sided Mann-Kendall trend test  0.644362383361713                      1
3  0.106262626262626   two.sided Mann-Kendall trend test  0.117932376736468                      1
4 0.0181818181818182   two.sided Mann-Kendall trend test   0.79096910003836                      1
5 -0.113131313131313   two.sided Mann-Kendall trend test 0.0959591933285242                      1
6 0.0153535353535354   two.sided Mann-Kendall trend test  0.823256299016955                      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM