简体   繁体   中英

Pairwise distance between collections of points with latitude and longitude

I have two sets of points with their latitudes and longitudes, and I want to calculate the pairwise distance between them. This works when the two lists are small:

from geopy.distance import distance

c1 = [(-34.7102, -58.3853),
     (-32.9406, -60.7136),
     (-34.6001, -58.3729),
     (-38.9412, -67.9948),
     (-35.1871, -59.0968)]

c2 = [(-43.2568, -65.2853),
     (-31.4038, -64.1645),
     (-34.7634, -58.2120),
     (-34.4819, -58.5828),
     (-34.5669, -58.4515),
     (-34.6356, -68.369),
     (-34.4048, -58.6896)]

distances = []
for c in c1:
    this_row = [distance(c, x).meters for x in c2]
    distances.append(this_row)

However, the actual lengths of c1 and c2 are 50000 and 15000, respectively. When I run the above script with my real data, it takes forever. I'm looking for something efficient, such as

distances = scipy.spatial.distance.cdist(c1, c2)

This is very fast, but the function returns the results in a unit that is not specified as far as I know. I'm looking for results in meters.

Is there any way to rewrite the first script in a more efficient way?

I considered some options. Here's what I learnt, hope this helps:

scipy.distance.cdist :
It seems to accept a callable as metric parameter, but I think a custom function will get things slow as well.

scikitlearn.neighbors.DistanceMetric :
It has a builtin haversine metric.
Anyway, I didn't manage to well understand how to get things working, but I'm sure you'll find a way. Moreover, they claim that, for many metrics, DistanceMetric.pairwise will be slower than scipy.cdist .

Projection:
The only acceptable solution I found implies a projection like aeqd of your coordinates on a 2D plane (I'm going to use pyproj for this).
This allows you to use scipy.cdist on projected points and get things faster, but it will get less precise on pairs too far from a lat_0, lon_0 coordinate used as a reference for aeqd projection (maybe a different projection, or some workaround can solve this).
I posted results from your loop and projection for comparison.

Code:

import numpy as np
import pyproj
import scipy
from geopy.distance import distance

c1 = np.array(
    [(-34.7102, -58.3853),
     (-32.9406, -60.7136),
     (-34.6001, -58.3729),
     (-38.9412, -67.9948),
     (-35.1871, -59.0968)]
    )

c2 = np.array(
    [(-43.2568, -65.2853),
     (-31.4038, -64.1645),
     (-34.7634, -58.2120),
     (-34.4819, -58.5828),
     (-34.5669, -58.4515),
     (-34.6356, -68.369),
     (-34.4048, -58.6896)]
)

# create projections, using a mean (lat, lon) for aeqd
lat_0, lon_0 = np.mean(np.append(c1[:,0], c2[:,0])), np.mean(np.append(c1[:,1], c2[:,1]))
proj = pyproj.Proj(proj='aeqd', lat_0=lat_0, lon_0=lon_0, x_0=lon_0, y_0=lat_0)
WGS84 = pyproj.Proj(init='epsg:4326')

# transform coordinates
projected_c1 = pyproj.transform(WGS84, proj, c1[:,1], c1[:,0])
projected_c2 = pyproj.transform(WGS84, proj, c2[:,1], c2[:,0])
projected_c1 = np.column_stack(projected_c1)
projected_c2 = np.column_stack(projected_c2)

# calculate pairwise distances in km with both methods
sc_dist = scipy.spatial.distance.cdist(projected_c1, projected_c2)
geo_distances = []
for c in c1:
    this_row = [distance(c, x).km for x in c2]
    geo_distances.append(this_row)

print("scipy\n")
print(sc_dist/1000)
print("\n")
print("geopy\n")
print(np.array(geo_distances))

Output:

scipy

[[1120.68384362  652.43817992   16.93436992   31.1480337    17.02161533
   914.68158465   43.91751967]
 [1212.75267066  367.46344647  307.41739698  261.2734859   276.57111944
   733.44881488  248.25303017]
 [1131.82744423  646.91757042   23.36452322   23.31086804    8.09877062
   916.39849619   36.27486327]
 [ 531.58906215  906.44775882  987.23837525  974.96389103  979.98229079
   479.75111318  971.51078808]
 [1042.57374645  631.42752409   93.47695658   91.28419725   90.64134205
   849.25121659   94.46063802]]


geopy

[[1120.50400287  652.32406273   16.93254254   31.1392657    17.01619952
   914.66757909   43.9058496 ]
 [1212.7494454   367.3591636   307.3468806   261.21313155  276.50708156
   733.28119124  248.19563872]
 [1131.65345927  646.79571942   23.35783766   23.30613446    8.09745879
   916.38027748   36.26700778]
 [ 530.49964531  905.85826336  987.20594883  974.95078113  979.96382386
   478.97343089  971.50158032]
 [1042.44765568  631.37206038   93.47402012   91.2737422    90.63359193
   849.24940173   94.44779778]]

cdist supports custom distance function, you can pass it like this:

from scipy.spatial.distance import cdist
from geopy.distance import distance as geodist # avoid naming confusion

sc_dist = cdist(c1, c2, lambda u, v: geodist(u, v).meters) # you can choose unit here

I'm not sure about the performance though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM