简体   繁体   English

R 查找两个美国邮政编码列之间的距离

[英]R Find the Distance between Two US Zipcode columns

I was wondering what the most efficient method of calculating the distance in miles between two US zipcode columns would be using R.我想知道使用 R 计算两个美国邮政编码列之间的距离的最有效方法是什么。

I have heard of the geosphere package for computing the difference between zipcodes but do not fully understand it and was wondering if there were alternative methods as well.我听说过用于计算邮政编码之间差异的 geosphere 包,但并不完全理解它,并且想知道是否还有其他方法。

For example say I have a data frame that looks like this.例如说我有一个看起来像这样的数据框。

 ZIP_START     ZIP_END
 95051         98053
 94534         94128
 60193         60666
 94591         73344
 94128         94128
 94015         73344
 94553         94128
 10994         7105
 95008         94128

I want to create a new data frame that looks like this.我想创建一个看起来像这样的新数据框。

 ZIP_START     ZIP_END     MILES_DIFFERENCE
 95051         98053       x
 94534         94128       x
 60193         60666       x
 94591         73344       x
 94128         94128       x
 94015         73344       x
 94553         94128       x
 10994         7105        x
 95008         94128       x

Where x is the difference in miles between the two zipcodes.其中 x 是两个邮政编码之间的英里差。

What is the best method of calculating this distance?计算此距离的最佳方法是什么?

Here is the R code to create the example data frame.这是创建示例数据框的 R 代码。

df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, 7105, 94128))

Please let me know if you have any questions.请让我知道,如果你有任何问题。

Any advice is appreciated.任何建议表示赞赏。

Thank you for your help.感谢您的帮助。

There is a handy R package out there named "zipcode" which provides a table of zip code, city, state and the latitude and longitude.有一个名为“zipcode”的方便的 R 包,它提供了一个包含邮政编码、城市、州以及纬度和经度的表格。 So once you have that information, the "geosphere" package can calculate the distance between points.所以一旦你有了这些信息,“geosphere”包就可以计算点之间的距离。

library(zipcode)
library(geosphere)

#dataframe need to be character arrays or the else the leading zeros will be dropped causing errors
df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), 
       "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128), 
       stringsAsFactors = FALSE)

data("zipcode")

df$distance_meters<-apply(df, 1, function(x){
  startindex<-which(x[["ZIP_START"]]==zipcode$zip)
  endindex<-which(x[["ZIP_END"]]==zipcode$zip)
  distGeo(p1=c(zipcode[startindex, "longitude"], zipcode[startindex, "latitude"]), p2=c(zipcode[endindex, "longitude"], zipcode[endindex, "latitude"]))
})

Warning about your column class for your input data frame.关于输入数据框的列类的警告。 Zip codes should be a character and not numeric, otherwise leading zeros are dropped causing errors.邮政编码应该是字符而不是数字,否则会丢弃前导零导致错误。

The return distance from distGeo is in meters, I will allow the reader to determine the proper unit conversion to miles.从 distGeo 返回的距离以米为单位,我将允许读者确定正确的单位转换为英里。

Update更新
The zipcode package appears to have been archived.邮政编码包似乎已存档。 There is a replacement package: "zipcodeR" which provides the longitude and latitude data along with addition information.有一个替换包:“zipcodeR”,它提供经度和纬度数据以及附加信息。

As Dave2e mentioned the original zipcode package was already removed from CRAN so we need use zipcodeR instead.正如 Dave2e 提到的,原始 zipcode 包已经从 CRAN 中删除,所以我们需要使用 zipcodeR 代替。

if (!require("zipcodeR"))install.packages("zipcodeR")
if (!require("geosphere"))install.packages("geosphere")

df <- data.frame(
  "ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008),
  "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128),
  stringsAsFactors = FALSE
)

data("zip_code_db")

df$distance_meters<-apply(df, 1, function(x){
  startindex<-which(x[["ZIP_START"]]==zip_code_db$zipcode)
  endindex<-which(x[["ZIP_END"]]==zip_code_db$zipcode)
  distGeo(p1=c(zip_code_db[startindex, "lng"], 
               zip_code_db[startindex, "lat"]), 
          p2=c(zip_code_db[endindex, "lng"], 
               zip_code_db[endindex, "lat"]))
})

Here's a fix based on new zipcodeR package.这是基于新 zipcodeR 包的修复。 And the credit goes to Dave2e.归功于 Dave2e。

The OP asks for "most efficient", so given OP 要求“最有效”,因此给出

  • geosphere is quite slow when you want to use it on lots of data当您想在大量数据上使用geosphere时,它的速度非常慢
  • apply is a essentially a looping function and can often be beaten using vectorised code apply本质上是一个循环函数,通常可以使用矢量化代码击败

I propose a fully vectorised solution using data.table and library(geodist)我提出了一个使用data.tablelibrary(geodist)的完全矢量化的解决方案


#dataframe need to be character arrays or the else the leading zeros will be dropped causing errors
df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), 
                 "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128), 
                 stringsAsFactors = FALSE)


library(zipcodeR)
library(data.table)
library(geodist)

## Convert the zip codes to data.table so we can join on them
## I'm using the centroid of the zipcodes (lng and lat).
## If you want the distance to the endge of the zipcode boundary you'll
## need to convert this into a spatial data set
dt_zips <- as.data.table( zip_code_db[, c("zipcode", "lng", "lat")])

## convert the input data.frame into a data.talbe
setDT( df )

## the postcodes need to be characters
df[
  , `:=`(
    ZIP_START = as.character( ZIP_START )
    , ZIP_END = as.character( ZIP_END )
  )
]

## Attach origin lon & lat using a join
df[
  dt_zips
  , on = .(ZIP_START = zipcode)
  , `:=`(
    lng_start = lng
    , lat_start = lat
  )
]

## Attach destination lon & lat using a join
df[
  dt_zips
  , on = .(ZIP_END = zipcode)
  , `:=`(
    lng_end = lng
    , lat_end = lat
  )
]

## calculate the distance
df[
  , distance_metres := geodist::geodist_vec(
    x1 = lng_start
    , y1 = lat_start
    , x2 = lng_end
    , y2 = lat_end
    , paired = TRUE
    , measure = "haversine"
  )
]

## et voila - note the missing zipcode 6066 and 73344
df

#    ZIP_START ZIP_END lng_start lat_start lng_end lat_end distance_metres
# 1:     95051   98053   -121.98     37.35 -122.02   47.66      1147708.60
# 2:     94534   94128   -122.10     38.20 -122.38   37.62        69090.01
# 3:     60193   60666    -88.09     42.01      NA      NA              NA
# 4:     94591   73344   -122.20     38.12      NA      NA              NA
# 5:     94128   94128   -122.38     37.62 -122.38   37.62            0.00
# 6:     94015   73344   -122.48     37.68      NA      NA              NA
# 7:     94553   94128   -122.10     38.00 -122.38   37.62        48947.02
# 8:     10994   07105    -73.97     41.10  -74.15   40.72        44930.17
# 9:     95008   94128   -121.94     37.28 -122.38   37.62        54263.61

Also note the returned distance is given in metres.另请注意,返回的距离以米为单位。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM