[英]How to use doParallel for calculating distance between zipcodes in R?
I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. 我有一个大型数据集(2.6M行),有两个邮政编码和相应的纬度和经度,我正在尝试计算它们之间的距离。 I am primarily using the package
geosphere
to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. 我主要使用包
geosphere
来计算邮政编码之间的Vincenty Ellipsoid距离,但是我的数据集耗费了大量时间。 What can be a fast way to implement this? 有什么可以快速实现这个?
What I tried 我尝试了什么
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
Both of these methods are extremely slow. 这两种方法都非常慢。 I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster.
我知道2.1M行永远不会是一个“快速”计算,但我认为它可以更快。 I have tried the following approach on a smaller test data without any luck,
我已经尝试了以下方法对较小的测试数据没有任何运气,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
Can anyone help me out with either the correct way to use doParallel
with geosphere
or a better way to handle this? 谁能帮我出既可以正确的使用方法
doParallel
与geosphere
或更好的方式来处理这个问题?
Edit: Benchmarks from (some) replies 编辑:(部分)回复的基准
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
A simple all.equal
test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist
package. 一个简单的
all.equal
测试表明,对于我的数据集,hasrsine方法等于vincenty方法,但是与geodist
包中的“廉价”方法具有“平均相对差异:0.01002573”。
R is a vectorized language, thus the function will operate over all of the elements in the vectors. R是矢量化语言,因此该函数将对矢量中的所有元素进行操作。 Since you are calculating the distance between the original and destination for each row, the loop is unnecessary.
由于您要计算每行的原始距离和目标距离,因此不需要循环。 The vectorized approach is approximately 1000x the performance of the loop.
矢量化方法大约是循环性能的1000倍。
Also using the distVincentyEllipsoid
(or distHaveersine, etc. )directly and bypassing the distm
function should also improve the performance. 另外直接使用
distVincentyEllipsoid
(或distHaveersine等)并绕过distm
函数也应该提高性能。
Without any sample data this snippet is untested. 没有任何样本数据,此代码段未经测试。
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude. 注意: 为了使大多数地圈功能正常工作,正确的顺序是:经度首先是纬度。
The reason the tidyverse approach listed above is slow is the distm
function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix. 上面列出的整摆方法很慢的原因是,
distm
函数计算每个起点和目的地之间的距离,这将导致200 distm
200万个元素矩阵。
If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. 如果你打算使用geosphere,我会使用像distHaversine这样的快速近似方法,或者使用仍然快速且非常精确的distGeo方法。 (The distVincenty* these are mainly implemented for curiosity).
(distVincenty *这些主要是为了好奇而实施的)。
I used @SymbolixAU's suggestion to use the geodist
package to perform the 2.1M distance calculations on my datasets. 我使用@ SymbolixAU的建议来使用
geodist
包来对我的数据集执行2.1M距离计算。 I found it to be significantly faster than the geosphere
package for every test (I have added one of them in my main question). 我发现每次测试都比
geosphere
包快得多(我在主要问题中添加了其中一个)。 The measure=cheap
option in the geodist
uses the cheap ruler method which has low error rates below distances of 100kms. geodist
的measure=cheap
选项使用廉价的标尺方法,该方法在100kms的距离以下具有低错误率。 See the geodist vignette for more information. 有关详细信息,请参阅geodist vignette 。 Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.
鉴于我的一些距离高于100公里,我决定使用Vincenty Ellipsoid测量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.