如何使用 data.table 有效地计算一个数据集中的 GPS 点与另一数据集中的 GPS 点之间的距离

Question

I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop.我在 R 中面临编码（优化）问题。我有一个带有 GPS 坐标（经度、纬度、时间戳）的长数据集，对于每一行，我都需要检查该位置是否靠近公交车站。 I have a .csv file with all the bus stops (in the Netherlands).我有一个包含所有巴士站（在荷兰）的 .csv 文件。 The GPS coordinates file is millions of entries long, but could be split if necessary. GPS 坐标文件有数百万个条目，但如有必要可以拆分。 The bus stop dataset is around 5500 entries long.公交车站数据集大约有 5500 个条目。
Using the code and tips given on, inter alia, these pages:使用这些页面上给出的代码和提示：

1) How to efficiently calculate distance between pair of coordinates using data.table := 1）如何使用data.table有效地计算坐标对之间的距离：=

2) Using a simple for loop on spatial data 2) 对空间数据使用简单的 for 循环

3) Calculate distance between two latitude-longitude points? 3）计算两个经纬度点之间的距离？ (Haversine formula) (Haversine 公式)

4) Fastest way to determine COUNTRY from millions of GPS coordinates [R] 4) 从数百万个 GPS 坐标中确定国家的最快方法 [R]

I was able to construct a code that works, but is (too) slow.我能够构建一个有效的代码，但是（太）慢了。 I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is?我想知道是否有人可以帮助我更快地实现 data.table() 或者可以指出我代码中的瓶颈在哪里？ Is it the spDistsN1() function, or maybe the apply and melt() functions combination?是 spDistsN1() 函数，还是 apply 和melt() 函数的组合？ I am most comfortable in R, but open to other software (as long as it is open source).我最熟悉 R，但对其他软件开放（只要它是开源的）。

Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.由于隐私问题，我无法上传完整的数据集，但这是一个（小）可重现的示例，与真实数据的外观没有太大区别。

# packages:
library(data.table)
library(tidyverse)
library(sp)


# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates, 
                             lat=runif(number_of_GPS_coordinates,50.5,53.5), 
                             lon=runif(number_of_GPS_coordinates,4,7)))

# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))

names(busdata.data) <- c("latitude_bustops",  "longitude_bustops", "name")

Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.如果你愿意，下载真正的公交车站数据，很难复制一个随机样本。

#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
#  mutate(latitude_bustops = latitude)%>%
#  mutate(longitude_bustops = longitude)%>%
#  dplyr::select(name, latitude_bustops,  longitude_bustops)

Code I use now to calculate distances.我现在用来计算距离的代码。 It works but it is pretty slow它有效，但速度很慢

countDataPoints3 <- function(p) {
  distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]), 
                         p,
                         longlat=TRUE) # in km
  return(which(distances <= .2)) # distance is now set to 200 meters
}


# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation       <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)


# rename list entries
names(datapoints.by.bustation) <- busdata.data$name

# melt list into one big data.frame
long.data.frame.busstops       <- melt(datapoints.by.bustation)

# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)

# merge the data, and filter non-unique entries 
setkey(long.data.frame.busstops, key = "value")
GPS.joined        <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak

# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.

# makes sense:
as.tibble(GPS.joined.unique2) %>%
  summarize(sum = sum(nearby))

Answer 1

Consider cutting using an slicing method: first cut by close latitudes and close longitudes.考虑使用切片方法进行切割：首先按近纬度和近经度切割。 In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc).在这种情况下，0.5 纬度和 0.5 经度（仍然是大约 60 公里的圆盘）。 We can use data.table 's superb support of rolling joins.我们可以使用data.table对滚动连接的出色支持。

The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.下面的 20,000 个条目需要几毫秒，而 2M 条目只需几秒钟。

library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)

gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)

setkey(gpsdata, lat)

# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]

busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]


setkey(busdata.data, lat)

gpsID_by_lat <- 
  gpsdata[, .(id), keyby = "lat"]


By_latitude <- 
  busdata.data[gpsdata, 
               on = "lat",

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),

               # and remove those beyond 0.5 degrees
               nomatch=0L] %>%
  .[, .(id_lat = id,
        name_lat = name,
        bus_lat = latitude_bustops,
        bus_lon = longitude_bustops,
        gps_lat,
        gps_lon),
    keyby = .(lon = gps_lon)]

setkey(busdata.data, lon)

By_latlon <-
  busdata.data[By_latitude,
               on = c("name==name_lat", "lon"),

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),
               # and remove those beyond 0.5 degrees
               nomatch=0L]

By_latlon[, distance := haversine_distance(lat1 = gps_lat, 
                                           lon1 = gps_lon,
                                           lat2 = bus_lat,
                                           lon2 = bus_lon)]

By_latlon[distance < 0.2]

Answer 2

This is the function I came up with so far.这是我目前想到的功能。 @Dave2e, thanks. @Dave2e，谢谢。 It is already an awful lot faster than what I had.它已经比我拥有的快得多。 There still is clearly room for a lot of improvement, but as it stands it is fast enough for my analysis now.显然仍然有很多改进的空间，但就目前而言，它对于我现在的分析来说已经足够快了。 I only slice by latitude and not longitude.我只按纬度而不是经度切片。 The only reason for that is that it makes indexing and then looping over indices really easy, but more speed could be gained by also indexing by longitude.唯一的原因是它使索引和索引循环变得非常容易，但通过经度索引可以获得更高的速度。 Also, in real GPS data there tend to be many duplicate values (same lon/lat, different time stamp), the code would also be more efficient if it would take this into account.此外，在实际的 GPS 数据中，往往有许多重复值（相同的经度/纬度，不同的时间戳），如果考虑到这一点，代码也会更有效。 Maybe I will work on that in the future.也许我将来会在这方面努力。

# this app could be much faster if it would filter by duplicate GPS coordinates

check_if_close <- function(dataset1     = GPS.Utrecht.to.Gouda,       
                           dataset2     = bus_stops,     
                           n.splits     = 500,
                           desired.dist = .2){

# dataset1 needs at least the columns 
#  - "id", 
#  - "device_id"
#  - "latitude"
#  - "longitude"

# dataset2 needs at least the columns 
#  - "id", 
#  - "name"
#  - "latitude"
#  - "longitude"

# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters 
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max  (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.  

# 1) we only select the variables we need from dataset 1
  dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
  setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))

# 2) we only select the variables we need from dataset 2
  dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
  setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))

# 3) only keep subet of dataset2 that falls within dataset 1. 
#    There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
#    We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates, 
#    but is still near to a GPS point
  dataset2 <- dataset2[latitude_feature  %between%  (range(dataset1$latitude_gps) + c(-0.01, +0.01)) 
                       & longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]

# 4) we cut the dataset2 into slices on the latitude dimension
#    some trial  and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
#    if you add to few you get you need to check too many GPS to feauture distances per slice



dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]

# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))


# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
#    this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]

# 7) this is the actual function we use to check if a datapoint is nearby.
#    potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
  distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]), 
                         p,
                         longlat=TRUE) # in km
  return(which(distances <= dist)) # distance is now set to 200 meters
}

# 8) we assign a ID to the dataset1 starting again at every slice.
#    we need this to later match the data again 
dataset1[, ID2 := sequence(.N), by = latitude_range]

# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
#    to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
#    are nearby, but just fall in a different slice.
#         9a: create an empty list we fill with dataframes later 
TT<-vector("list", length=n.splits)
#         9b: loop over the number of slices using above defined function

for(i in 1:n.splits){
  datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
#         9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
    if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
    TT[[i]] <-NULL
  } else {
#         9d: if there was a match we get a list of data point that are named
    names(datapoints.near.feature)    <- dataset2[range2 %in% c(i-1,i, i+1), name]
#         9e: then we 'melt' this list into  data.frame
    temp <- melt(datapoints.near.feature)
#         9f: then we transform it into a data.table and change the names
    setDT(temp)
    setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
#         9h: then we only select the data point in dataset1 that fall in the current slice give them an 
#             ID and merge them with the file of nearby busstops
    gpsdata.f <- dataset1[latitude_range==i, ]
    gpsdata.f[, rowID2 := seq_len(.N)]
    setkey(gpsdata.f, key = "rowID2")
    setkey(temp, key = "value")
    GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
#         9i: we only keep the unique entries and for every slice save them to the list
    GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
    TT[[i]] <-  GPS.joined.unique.temp 
    cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()


    #cat(i/n.splits*100, " \r"); flush.console()

  }

}

# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1, 
                  y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]), 
                  by.x= "id_dataset1", 
                  by.y= "id_dataset1", 
                  all.x=TRUE)

# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.

# 12) if a point is near multiple features at once these are listed in a vector,
#     instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")

return(finallist)
}

如何使用 data.table 有效地计算一个数据集中的 GPS 点与另一数据集中的 GPS 点之间的距离

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-11-10 13:42:24

解决方案2
0 2018-11-10 12:41:12

如何使用 data.table 有效地计算一个数据集中的 GPS 点与另一数据集中的 GPS 点之间的距离

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-11-10 13:42:24

解决方案2 0 2018-11-10 12:41:12

解决方案1
1 已采纳 2018-11-10 13:42:24

解决方案2
0 2018-11-10 12:41:12