简体   繁体   English

如何计算R中低于某个阈值的2个坐标之间的距离?

[英]How to calculate distance between 2 coordinates below a certain threshold in R?

I have 44,000 US Zip codes and it's corresponding centroid lat/long in R. This is from the package 'zipcode' in R. I need to calculate the distance between each zipcode and keep those distances that are less than 5 miles.我有 44,000 个美国邮政编码,它是 R 中相应的质心纬度/经度。这是来自 R 中的“邮政编码”包。我需要计算每个邮政编码之间的距离并保持这些距离小于 5 英里。 The problem is to calculate all distances between the zipcodes I have to create a vector of size 44,000x44,0000 which I can't due to space issues.问题是计算邮政编码之间的所有距离,我必须创建一个大小为 44,000x44,0000 的向量,由于空间问题,我无法创建该向量。

I checked through the posts in R, the closest to my requirement is one that spits out the minimum distance between 2 datasets with lat/long我检查了 R 中的帖子,最接近我的要求的是吐出 2 个数据集之间的最小距离的经纬度/经度

DB1 <- data.frame(location_id=1:7000,LATITUDE=runif(7000,min = -90,max = 90),LONGITUDE=runif(7000,min = -180,max = 180))
DB2 <- data.frame(location_id=7001:12000,LATITUDE=runif(5000,min = -90,max = 90),LONGITUDE=runif(5000,min = -180,max = 180))

DistFun <- function(ID){
  TMP <- DB1[DB1$location_id==ID,]
  TMP1 <- distGeo(TMP[,3:2],DB2[,3:2])
  TMP2 <- data.frame(DB1ID=ID,DB2ID=DB2[which.min(TMP1),1],DistanceBetween=min(TMP1)      ) 
  print(ID)
  return(TMP2)
}

DistanceMatrix <- rbind_all(lapply(DB1$location_id, DistFun))

Even if we can modify the above code to incorporate all distances <= 5 miles (for eg), it is extremely slow in execution.即使我们可以修改上面的代码以包含所有距离 <= 5 英里(例如),它的执行速度也非常慢。

Is there an efficient way to arrive at all zip code combinations that are <=5 miles from each others centroids?是否有一种有效的方法可以到达距离彼此质心 <= 5 英里的所有邮政编码组合?

Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming.一次生成整个距离矩阵将非常消耗 RAM,循环遍历每个唯一邮政编码的组合 - 非常耗时。 Lets find some compromise.让我们找到一些妥协。

I suggest chunking the zipcode data.frame into pieces of (for example) 100 rows (with the help of chunk function from package bit ), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk.我建议将zipcode data.frame分成(例如)100 行(借助包bitchunk功能),然后计算 44336 和 100 点之间的距离,根据目标距离阈值进行过滤,然后继续下一个数据块。 In my example I convert zipcode data into data.table to gain some speed and save RAM.在我的示例中,我将zipcode数据转换为data.table以提高速度并节省 RAM。

library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)

data(zipcode)

setDT(zipcode)
zipcode[, dum := NA] # we'll need it for full outer join

Just for information - that's the approximate size of each piece of data in RAM.仅供参考 - 这是 RAM 中每条数据的大致大小。

merge(zipcode, zipcode[1:100], by = "dum", allow.cartesian = T) %>% 
  object.size() %>% print(unit = "Mb")
# 358.2 Mb

The code itself.代码本身。

lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
  merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
    , dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                      matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
    ][dist <= 5 # necessary distance treshold
      ][, dum := NULL]
  }) %>% rbindlist -> zip_nearby_dt

zip_nearby_dt # not the whole! for first 10 chunks only

       zip.x          city.x state.x latitude.x longitude.x zip.y     city.y state.y latitude.y longitude.y     dist
    1: 00210      Portsmouth      NH   43.00590   -71.01320 00210 Portsmouth      NH   43.00590   -71.01320 0.000000
    2: 00210      Portsmouth      NH   43.00590   -71.01320 00211 Portsmouth      NH   43.00590   -71.01320 0.000000
    3: 00210      Portsmouth      NH   43.00590   -71.01320 00212 Portsmouth      NH   43.00590   -71.01320 0.000000
    4: 00210      Portsmouth      NH   43.00590   -71.01320 00213 Portsmouth      NH   43.00590   -71.01320 0.000000
    5: 00210      Portsmouth      NH   43.00590   -71.01320 00214 Portsmouth      NH   43.00590   -71.01320 0.000000
---                                                                                                              
15252: 02906      Providence      RI   41.83635   -71.39427 02771    Seekonk      MA   41.84345   -71.32343 3.688747
15253: 02912      Providence      RI   41.82674   -71.39770 02771    Seekonk      MA   41.84345   -71.32343 4.003095
15254: 02914 East Providence      RI   41.81240   -71.36834 02771    Seekonk      MA   41.84345   -71.32343 3.156966
15255: 02916         Rumford      RI   41.84325   -71.35391 02769   Rehoboth      MA   41.83507   -71.26115 4.820599
15256: 02916         Rumford      RI   41.84325   -71.35391 02771    Seekonk      MA   41.84345   -71.32343 1.573050

On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying.在我的机器上,处理 10 个块需要 1.7 分钟,所以整个处理可能需要 70-80 分钟,不快,但可能令人满意。 We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.我们可以根据可用的 RAM 容量将块大小增加到 200 或 300 行,这将分别缩短 2 或 3 倍的处理时间。

The drawback of this solution is that the resulting data.table contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.此解决方案的缺点是生成的data.table包含“重复”行 - 我的意思是从 A 点到 B 点以及从 B 到 A 都有距离。这可能需要一些额外的过滤。

I guess the most efficient algorithms would first translate the spatial locations to a tree-like data structure.我想最有效的算法会首先将空间位置转换为树状数据结构。 You don't need to do this explicitly though, if you have an algorithm that can 1) bin lat/longs to a spatial index, 2) tell you neighbors of that index, then you can use it to filter your square data.不过,您不需要明确地执行此操作,如果您有一个算法可以 1) 将纬度/经度划分为空间索引,2) 告诉您该索引的邻居,那么您可以使用它来过滤您的方形数据。 (This will be less efficient than building a tree, but probably easier to implement.) (这会比构建一棵树效率低,但可能更容易实现。)

geohash is such an algorithm. geohash就是这样一种算法。 It turns continuous lat/long into 2-d bins.它将连续的纬度/经度转换为二维 bin。 There is a (quite new) package providing geohash in R .有一个(相当新的)包在 R 中提供 geohash Here's one idea of how you could use it for this problem:以下是如何使用它解决此问题的一个想法:

First, with geohash do some preliminary calibration :首先,使用 geohash 做一些初步校准

  1. translate lat/long to a hash with bin precision p (say)将 lat/long 转换为 bin 精度为p的散列(比如说)

  2. assess whether the hash is calibrated at a precision similar to the distances you're interested in (say, 3-7 miles between neighbor centroids), if not return to 1 and adjust the precision p评估哈希是否以与您感兴趣的距离相似的精度进行校准(例如,相邻质心之间的距离为 3-7 英里),如果不是,则返回1并调整精度p

this yields a zipcode - hash value relationship.这产生了邮政编码-哈希值关系。

Then, compute distances for each (unique) hash value然后,计算每个(唯一的)哈希值的距离

  1. determine its (8, bc hashes form a 2-d grid) nearest-neighbors and so select 9 hash values确定它的 (8, bc 散列形成一个二维网格) 最近邻,因此选择 9 个散列值

  2. calculate pairwise distances among all zips within the 9 hashes (using, eg, distGeo as in the question)计算 9 个哈希中所有distGeo之间的成对距离(使用,例如,在问题中使用distGeo

  3. return all zip-zip pairwise distances for the hash value (eg, in a matrix)返回哈希值的所有 zip-zip 成对距离(例如,在矩阵中)

this yields a hash value - zip-zip distance object relationship这会产生一个哈希值- zip-zip 距离对象关系

(In step 2 it'd clearly be optimal to only calculate each nearest-neighbor pair once. But this might not be necessary.) (在第2步中,最好只计算一次最近邻对。但这可能不是必需的。)

Finally, for each zip最后,对于每个 zip

  1. use the above two steps to (through the hash value as key) get the zip-zip使用以上两步(通过hash值作为key)得到zip-zip
    distance object for the zip拉链的距离对象
  2. filter the object to the distances from the focal zip (recall, it's all pairwise distances within a set of hashes adjacent to that of the focal zip)将对象过滤到与焦点 zip 的距离(回想一下,它是与焦点 zip 相邻的一组散列中的所有成对距离)
  3. only keep distances < 5 miles仅保持距离< 5 miles

this yields a zip - zips within 5 miles object.这会产生一个拉链-在 5 英里的物体内拉链 (the zips within 5 miles of the focal zip could be stored as a column of lists (each element is a list) in a dataframe next to a column of focal zips, or as a separate list with focal zips as names). (焦点 zip 5 英里内的 zip 可以存储为一列列表(每个元素都是一个列表),位于一列焦点 zip 旁边的数据框中,或者存储为一个单独的列表,以焦点 zip 作为名称)。

The following is a solution using spatialrisk .以下是使用spatialrisk的解决方案。 The functions are written in C++ and are therefore very fast.这些函数是用 C++ 编写的,因此速度非常快。 On my machine it takes about 25 seconds.在我的机器上大约需要 25 秒。

library(zipcodeR)
library(spatialrisk)
library(dplyr)

# Zip code data
zipcode <- zipcodeR::zip_code_db

# Radius in meters
radius_meters <- 5000

# Find zipcodes within 5000 meters
sel <- tibble(zipcode) %>%
  select(zipcode, lat, lon = lng) %>%
  filter(!is.na(lat), !is.na(lon)) %>%
  mutate(zipcode_within_radius = purrr::map2(lon, lat, ~points_in_circle(zipcode_sel, .x, .y, radius = radius_meters)[-1,])) %>%
  unnest(cols = c(zipcode_within_radius), names_repair = "unique")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM