简体   繁体   English

如何使用doParallel计算R中zipcodes之间的距离?

[英]How to use doParallel for calculating distance between zipcodes in R?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. 我有一个大型数据集(2.6M行),有两个邮政编码和相应的纬度和经度,我正在尝试计算它们之间的距离。 I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. 我主要使用包geosphere来计算邮政编码之间的Vincenty Ellipsoid距离,但是我的数据集耗费了大量时间。 What can be a fast way to implement this? 有什么可以快速实现这个?

What I tried 我尝试了什么

library(tidyverse)
library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
for(i in seq_len(nrow(zipdata))){
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid)
}

## Tidyverse approach 
zipdata <- zipdata%>%
 mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong), 
   fun = distHaversine))

Both of these methods are extremely slow. 这两种方法都非常慢。 I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. 我知道2.1M行永远不会是一个“快速”计算,但我认为它可以更快。 I have tried the following approach on a smaller test data without any luck, 我已经尝试了以下方法对较小的测试数据没有任何运气,

library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)

test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)

foreach(i = seq_len(nrow(test))) %dopar% {
  library(geosphere)
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid) 
}
stopCluster(cl)

Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this? 谁能帮我出既可以正确的使用方法doParallelgeosphere或更好的方式来处理这个问题?

Edit: Benchmarks from (some) replies 编辑:(部分)回复的基准

## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
  dave = {
    # Dave2e
    zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
                                   cbind(zipsamp$faclong,zipsamp$faclat))
  },
  geohav = {
    zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "haversine")
  },
  geovin = {
    zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "vincenty")
  },
  geocheap = {
    zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "cheap")
  }
,unit = "s",times = 100)

# Unit: seconds
# expr        min         lq       mean     median         uq        max neval  cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556   100    d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521   100  b  
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889   100   c 
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961   100 a  

A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package. 一个简单的all.equal测试表明,对于我的数据集,hasrsine方法等于vincenty方法,但是与geodist包中的“廉价”方法具有“平均相对差异:0.01002573”。

R is a vectorized language, thus the function will operate over all of the elements in the vectors. R是矢量化语言,因此该函数将对矢量中的所有元素进行操作。 Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. 由于您要计算每行的原始距离和目标距离,因此不需要循环。 The vectorized approach is approximately 1000x the performance of the loop. 矢量化方法大约是循环性能的1000倍。
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance. 另外直接使用distVincentyEllipsoid (或distHaveersine等)并绕过distm函数也应该提高性能。

Without any sample data this snippet is untested. 没有任何样本数据,此代码段未经测试。

library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat), 
       c(zipdata$destlong, zipdata$destlat))

Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude. 注意: 为了使大多数地圈功能正常工作,正确的顺序是:经度首先是纬度。

The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix. 上面列出的整摆方法很慢的原因是, distm函数计算每个起点和目的地之间的距离,这将导致200 distm 200万个元素矩阵。

If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. 如果你打算使用geosphere,我会使用像distHaversine这样的快速近似方法,或者使用仍然快速且非常精确的distGeo方法。 (The distVincenty* these are mainly implemented for curiosity). (distVincenty *这些主要是为了好奇而实施的)。

I used @SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. 我使用@ SymbolixAU的建议来使用geodist包来对我的数据集执行2.1M距离计算。 I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). 我发现每次测试都比geosphere包快得多(我在主要问题中添加了其中一个)。 The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. geodistmeasure=cheap选项使用廉价的标尺方法,该方法在100kms的距离以下具有低错误率。 See the geodist vignette for more information. 有关详细信息,请参阅geodist vignette Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure. 鉴于我的一些距离高于100公里,我决定使用Vincenty Ellipsoid测量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM