I have two dataframes, both of which contain columns of latitude and longitude. For each lat/lon entry in the first dataframe, I want to evaluate each lat/lon pair in the second dataframe to determine distance.
For example:
df1: df2: lat lon lat lon 0 38.32 -100.50 0 37.65 -97.87 1 42.51 -97.39 1 33.31 -96.40 2 33.45 -103.21 2 36.22 -100.01 distance between 38.32,-100.50 and 37.65,-97.87 distance between 38.32,-100.50 and 33.31,-96.40 distance between 38.32,-100.50 and 36.22,-100.01 distance between 42.51,-97.39 and 37.65,-97.87 distance between 42.51,-97.39 and 33.31,-96.40 ...and so on...
I'm not sure how to go about doing this.
Thanks for the help!
Euclidean Distance is calculated as
You can do this with your two dataframes like this
((df1 - df2) ** 2).sum(1) ** .5
0 2.714001
1 9.253113
2 4.232363
dtype: float64
UPDATE: as noted by @root it doesn't really make much sense to use Euclidean metric in this case, so let's use sklearn.neighbors.DistanceMetric
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
first we can build a DF with all combinations - (c) root :
x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k',1)
vectorized "haversine" distance calculation
x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)
Result:
In [86]: x
Out[86]:
lat1 lon1 lat2 lon2 dist
0 38.32 -100.50 37.65 -97.87 242.073182
1 38.32 -100.50 33.31 -96.40 667.993048
2 38.32 -100.50 36.22 -100.01 237.350451
3 42.51 -97.39 37.65 -97.87 541.605087
4 42.51 -97.39 33.31 -96.40 1026.006744
5 42.51 -97.39 36.22 -100.01 734.219411
6 33.45 -103.21 37.65 -97.87 671.274044
7 33.45 -103.21 33.31 -96.40 632.004981
8 33.45 -103.21 36.22 -100.01 424.140594
OLD answer:
IIUC you can use pairwise distance scipy.spatial.distance.pdist :
In [32]: from scipy.spatial.distance import pdist
In [43]: from itertools import combinations
In [34]: X = pd.concat([df1, df2])
In [35]: X
Out[35]:
lat lon
0 38.32 -100.50
1 42.51 -97.39
2 33.45 -103.21
0 37.65 -97.87
1 33.31 -96.40
2 36.22 -100.01
as Pandas.Series:
In [36]: s = pd.Series(pdist(X),
index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))
In [37]: s
Out[37]:
0 1 5.218065
2 5.573240
0 2.714001
1 6.473801
2 2.156409
1 2 10.768287
0 4.883646
1 9.253113
2 6.813846
2 0 6.793791
1 6.811439
2 4.232363
0 1 4.582194
2 2.573810
1 2 4.636831
dtype: float64
as Pandas.DataFrame:
In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
df1 df2 dist
0 0 1 5.218065
1 0 2 5.573240
2 0 0 2.714001
3 0 1 6.473801
4 0 2 2.156409
5 1 2 10.768287
6 1 0 4.883646
7 1 1 9.253113
8 1 2 6.813846
9 2 0 6.793791
10 2 1 6.811439
11 2 2 4.232363
12 0 1 4.582194
13 0 2 2.573810
14 1 2 4.636831
You can perform a cross join to get all combinations of lat/lon, then compute the distance using an appropriate measure. To do so, you can use the geopy
package, which supplies geopy.distance.vincenty
and geopy.distance.great_circle
. Both should give valid distances, with vincenty
giving more accurate results, but being computationally slower.
from geopy.distance import vincenty
# Function to compute distances.
def get_lat_lon_dist(row):
# Store lat/long as tuples for input into distance functions.
latlon1 = tuple(row[['lat1', 'lon1']])
latlon2 = tuple(row[['lat2', 'lon2']])
# Compute the distance.
return vincenty(latlon1, latlon2).km
# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k', axis=1)
# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)
I used kilometers as my units in the example, but others can be specified, eg:
vincenty(latlon1, latlon2).miles
The resulting output:
lat1 lon1 lat2 lon2 distance
0 38.32 -100.50 37.65 -97.87 242.709065
1 38.32 -100.50 33.31 -96.40 667.878723
2 38.32 -100.50 36.22 -100.01 237.080141
3 42.51 -97.39 37.65 -97.87 541.184297
4 42.51 -97.39 33.31 -96.40 1024.839512
5 42.51 -97.39 36.22 -100.01 733.819732
6 33.45 -103.21 37.65 -97.87 671.766908
7 33.45 -103.21 33.31 -96.40 633.751134
8 33.45 -103.21 36.22 -100.01 424.335874
Edit
As noted by @MaxU in the comments, you can use a numpy implementation of the Haversine formula in a similar manner for extra performance. This should be equivalent to the great_circle
function in geopy
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.