简体   繁体   English

通过最小地理空间距离(R)匹配两个数据集

[英]Match two datasets by minimum geospatial distance (R)

I have the two following datasets: 我有以下两个数据集:

houses <- data.table(house_number = c(1:3),
                     lat_decimal = seq(1.1, 1.3, by = 0.1),
                     lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
                       lat_decimal = seq(1, 2, by = 0.1),
                       lon_decimal = seq(2, 3, by = 0.1))

I want to merge houses and stations together such that the resulting station_number is the station that's closest to the corresponding house_number . 我想合并housesstations在一起,使得产生的station_number是这是最接近相应车站house_number

This question is very similar , but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm from the geosphere package). 这个问题非常相似 ,但是我不确定它们是否在处理纬度和经度,而且我不知道在处理经度和纬度时如何计算距离(这就是为什么我更喜欢仅使用distm geosphere包)。

I have never worked with the outer function. 我从未使用过外部函数。 In case the answer from the aforementioned question would work, how can I adapt the answer to use the distm function instead of the sqrt function? 如果上述问题的答案可行,我该如何调整答案以使用distm函数而不是sqrt函数?

Your question is a bit more complicated than a simple merge, and outer is somewhat ill-suited for the purpose. 你的问题不是简单的合并更复杂一些,而且outer是有点不适合为宗旨。 To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. 为了尽可能全面,我们要计算房屋和车站的所有组合之间的距离,然后仅保留每个房屋最近的车站。 We'll need two packages: 我们需要两个软件包:

library(tidyverse)
library(geosphere)

First, a bit of prep. 首先,准备一下。 distm expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it: distm期望坐标distm经度排序, distm纬度排序(您却相反),所以让我们对其进行修复,更好地命名列,并在输入错误时更正错字:

houses <- data.frame(house_number = c(1:3),
                     lon_house = seq(1.4, 1.6, by = 0.1),
                     lat_house = seq(1.1, 1.3, by = 0.1)
                     )
stations <- data.frame(station_number = c(1:11),
                       lon_station = seq(2, 3, by = 0.1),
                       lat_station = seq(1, 2, by = 0.1)
                       )

We'll create "nested" data frames so that it's easier to keep coordinates together: 我们将创建“嵌套”数据框,以便更轻松地将坐标保持在一起:

house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')

  house_number house_coords        
         <int> <list>              
1            1 <data.frame [1 × 2]>
2            2 <data.frame [1 × 2]>
3            3 <data.frame [1 × 2]>

   station_number station_coords      
            <int> <list>              
 1              1 <data.frame [1 × 2]>
 2              2 <data.frame [1 × 2]>
 3              3 <data.frame [1 × 2]>
 4              4 <data.frame [1 × 2]>
 5              5 <data.frame [1 × 2]>
 6              6 <data.frame [1 × 2]>
 7              7 <data.frame [1 × 2]>
 8              8 <data.frame [1 × 2]>
 9              9 <data.frame [1 × 2]>
10             10 <data.frame [1 × 2]>
11             11 <data.frame [1 × 2]>

Use dplyr::crossing to combine every row from both data frames: 使用dplyr::crossing合并两个数据帧中的每一行:

data.master <- crossing(house_nest, station_nest)

   house_number house_coords         station_number station_coords      
          <int> <list>                        <int> <list>              
 1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]>
 2            1 <data.frame [1 × 2]>              2 <data.frame [1 × 2]>
 3            1 <data.frame [1 × 2]>              3 <data.frame [1 × 2]>
 4            1 <data.frame [1 × 2]>              4 <data.frame [1 × 2]>
 5            1 <data.frame [1 × 2]>              5 <data.frame [1 × 2]>
 6            1 <data.frame [1 × 2]>              6 <data.frame [1 × 2]>
 7            1 <data.frame [1 × 2]>              7 <data.frame [1 × 2]>
 8            1 <data.frame [1 × 2]>              8 <data.frame [1 × 2]>
 9            1 <data.frame [1 × 2]>              9 <data.frame [1 × 2]>
10            1 <data.frame [1 × 2]>             10 <data.frame [1 × 2]>
# ... with 23 more rows

With all this now in place, we can use distm on each row to calculate a distance, and keep the shortest distance per house: 现在,所有这些都准备就绪,我们可以在每行上使用distm来计算距离,并保持每个房屋的最短距离:

data.dist <- data.master %>% 
  mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>% 
  group_by(house_number) %>% 
  filter(dist == min(dist))

  house_number house_coords         station_number station_coords         dist
         <int> <list>                        <int> <list>                <dbl>
1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 67690.
2            2 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 59883.
3            3 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 55519.

Use match_nrst_haversine from hutilscpp : 使用match_nrst_haversinehutilscpp

library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
                                                             lon_decimal,
                                                             addresses_lat = stations$lat_decimal,
                                                             addresses_lon = stations$lon_decimal,
                                                             Index = stations$station_numer,
                                                             close_enough = 0,
                                                             cartesian_R = 5)]

houses
#>    house_number lat_decimal lon_decimal station_number     dist
#> 1:            1         1.1         1.4              1 67.62617
#> 2:            2         1.2         1.5              1 59.87076
#> 3:            3         1.3         1.6              1 55.59026

You may want to adjust close_enough and cartesian_R if your data are numerous (ie over a million points to match) for performance. 如果您的数据众多(例如,要匹配的百万分以上),则可能需要调整close_enoughcartesian_R

 `cartesian_R` 

The maximum radius of any address from the points to be geocoded. 从要进行地理编码的点开始的任何地址的最大半径。 Used to accelerate the detection of minimum distances. 用于加速最小距离的检测。 Note, as the argument name suggests, the distance is in cartesian coordinates, so a small number is likely. 请注意,正如自变量名称所暗示的那样,该距离是在笛卡尔坐标中,因此可能很小。

 `close_enough` 

The distance, in metres, below which a match will be considered to have occurred. 距离,以米为单位,低于该距离将被视为已经发生比赛。 (The distance that is considered "close enough" to be a match.) (被视为“足够接近”以匹配的距离。)

For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later. 例如,close_enough = 10表示将匹配十米内的第一个位置,即使稍后会发生更紧密的匹配。

May be provided as a string to emphasize the units, eg close_enough = "0.25km". 可以作为字符串来强调单位,例如close_enough =“ 0.25km”。 Only km and m are permitted. 仅允许km和m。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM