简体   繁体   中英

Python code to filter closest distance pairs

This is my code. Please note that this is just a toy dataset, my real set contains about a 1000 entries in each table.

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

/some calc code here/

##df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]##


df_dist_long.to_csv('dist.csv',float_format='%.2f')

When i add df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] . I get this error

 File "C:\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 656, in wrapper                                                    
    raise ValueError                                                                                                                                  
ValueError    

                                                                        
                                                           

Without it, the output is like so...

    city_A  neigh_B Dist(km)
0   City1   Neigh1  6.45
1   City2   Neigh1  6.42
2   City3   Neigh1  7.93
3   City4   Neigh1  5.56
4   City1   Neigh2  8.25
5   City2   Neigh2  6.67
6   City3   Neigh2  8.55
7   City4   Neigh2  8.92
8   City1   Neigh3  7.01   ..... and so on

What I want is another table that filters the city closest to the Neighbour. So as an example, for 'Neigh1', City4 is the closest(least in distance). So I want the table as below

city_A  neigh_B Dist(km)
0   City4   Neigh1  5.56
1   City3   Neigh2  4.32
2   City1   Neigh3  7.93
3   City2   Neigh4  3.21
4   City4   Neigh5  4.56
5   City5   Neigh6  6.67
6   City3   Neigh7  6.16
 ..... and so on

Doesn't matter if the city name gets repeated, I just want the closest pair saved to another csv. How can this be implemented, experts, please help!!

You don't want to calculate the full distance matrix if you just want the closest city for each neighbourhood.

Here is a working code example, though I get different output than yours. Maybe a lat/long mistake.

I used your data

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

Created a BallTree we can querie

from sklearn.neighbors import BallTree
import numpy as np

stores_gps = locations_stores[['latitude_A', 'longitude_A']].values
neigh_gps = locations_neigh[['latitude_B', 'longitude_B']].values

tree = BallTree(stores_gps, leaf_size=15, metric='haversine')

And for each Neigh we want to closest ( k=1 ) City/Store:

distance, index = tree.query(neigh_gps, k=1)
 
earth_radius = 6371

distance_in_km = distance * earth_radius

We can create a DataFrame of the result with

pd.DataFrame({
    'Neighborhood' : locations_neigh.neigh_B,
    'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,
    'Distance_to_city' : distance_in_km[:,0]
})

This gave me

  Neighborhood Closest_city  Distance_to_city
0       Neigh1        City2      19112.334106
1       Neigh2        City2      19014.154744
2       Neigh3        City2      18851.168702
3       Neigh4        City2      19129.555188
4       Neigh5        City4      15498.181486

Since our output is different, there is some mistake to correct. Maybe swapped lat/long, I am just guessing here. But this is the approach you want, especially for the amounts of your data.


Edit: For the Full matrix, use

from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('haversine')

earth_radius = 6371

haversine_distances = dist.pairwise(np.radians(stores_gps), np.radians(neigh_gps) )
haversine_distances *= earth_radius

This will give the full matrix, but be aware, for largers numbers it will take long, and expect hit memory limitation.

You could use numpy's np.argmin(haversine_distances, axis=1) to get similar results from the BallTree. It will result in the index of the closest in distance, which can be used just like in the BallTree example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM