comparing columns in two separate pandas dataframes

Question

I have two dataframes, both of which contain columns of latitude and longitude. For each lat/lon entry in the first dataframe, I want to evaluate each lat/lon pair in the second dataframe to determine distance.

For example:

df1:                     df2:

     lat     lon              lat     lon 
0   38.32  -100.50       0   37.65   -97.87
1   42.51   -97.39       1   33.31   -96.40
2   33.45  -103.21       2   36.22  -100.01

distance between 38.32,-100.50 and 37.65,-97.87
distance between 38.32,-100.50 and 33.31,-96.40
distance between 38.32,-100.50 and 36.22,-100.01
distance between 42.51,-97.39 and 37.65,-97.87
distance between 42.51,-97.39 and 33.31,-96.40
...and so on...

I'm not sure how to go about doing this.

Thanks for the help!

Answer 1

Euclidean Distance is calculated as

edpic

You can do this with your two dataframes like this

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64

Answer 2

UPDATE: as noted by @root it doesn't really make much sense to use Euclidean metric in this case, so let's use sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

first we can build a DF with all combinations - (c) root :

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

vectorized "haversine" distance calculation

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

Result:

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

OLD answer:

IIUC you can use pairwise distance scipy.spatial.distance.pdist :

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

as Pandas.Series:

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

as Pandas.DataFrame:

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

Answer 3

You can perform a cross join to get all combinations of lat/lon, then compute the distance using an appropriate measure. To do so, you can use the geopy package, which supplies geopy.distance.vincenty and geopy.distance.great_circle . Both should give valid distances, with vincenty giving more accurate results, but being computationally slower.

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

I used kilometers as my units in the example, but others can be specified, eg:

vincenty(latlon1, latlon2).miles

The resulting output:

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

Edit

As noted by @MaxU in the comments, you can use a numpy implementation of the Haversine formula in a similar manner for extra performance. This should be equivalent to the great_circle function in geopy .

comparing columns in two separate pandas dataframes

Question

3 answers

solution1
3 2017-04-03 18:54:10

solution2
3 2017-04-03 19:05:24

solution3
3 2017-04-03 19:29:31

comparing columns in two separate pandas dataframes

Question

3 answers

solution1 3 2017-04-03 18:54:10

solution2 3 2017-04-03 19:05:24

solution3 3 2017-04-03 19:29:31

solution1
3 2017-04-03 18:54:10

solution2
3 2017-04-03 19:05:24

solution3
3 2017-04-03 19:29:31