简体   繁体   English

用于过滤最近距离对的 Python 代码

[英]Python code to filter closest distance pairs

This is my code.这是我的代码。 Please note that this is just a toy dataset, my real set contains about a 1000 entries in each table.请注意,这只是一个玩具数据集,我的真实数据集在每个表中包含大约 1000 个条目。

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

/some calc code here/

##df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]##


df_dist_long.to_csv('dist.csv',float_format='%.2f')

When i add df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] .当我添加df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] I get this error我收到这个错误

 File "C:\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 656, in wrapper                                                    
    raise ValueError                                                                                                                                  
ValueError    

                                                                        
                                                           

Without it, the output is like so...没有它,输出就像这样......

    city_A  neigh_B Dist(km)
0   City1   Neigh1  6.45
1   City2   Neigh1  6.42
2   City3   Neigh1  7.93
3   City4   Neigh1  5.56
4   City1   Neigh2  8.25
5   City2   Neigh2  6.67
6   City3   Neigh2  8.55
7   City4   Neigh2  8.92
8   City1   Neigh3  7.01   ..... and so on

What I want is another table that filters the city closest to the Neighbour.我想要的是另一个过滤离邻居最近的城市的表。 So as an example, for 'Neigh1', City4 is the closest(least in distance).例如,对于“Neigh1”,City4 是最近的(距离最短)。 So I want the table as below所以我想要下表

city_A  neigh_B Dist(km)
0   City4   Neigh1  5.56
1   City3   Neigh2  4.32
2   City1   Neigh3  7.93
3   City2   Neigh4  3.21
4   City4   Neigh5  4.56
5   City5   Neigh6  6.67
6   City3   Neigh7  6.16
 ..... and so on

Doesn't matter if the city name gets repeated, I just want the closest pair saved to another csv.如果城市名称重复也没关系,我只想将最接近的一对保存到另一个 csv 中。 How can this be implemented, experts, please help!!这个怎么实现,请高手帮忙!!

You don't want to calculate the full distance matrix if you just want the closest city for each neighbourhood.如果您只想要每个街区最近的城市,您就不想计算完整的距离矩阵。

Here is a working code example, though I get different output than yours.这是一个有效的代码示例,尽管我得到的输出与您的不同。 Maybe a lat/long mistake.也许是纬度/经度错误。

I used your data我用了你的数据

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

Created a BallTree we can querie创建了一个我们可以查询的 BallTree

from sklearn.neighbors import BallTree
import numpy as np

stores_gps = locations_stores[['latitude_A', 'longitude_A']].values
neigh_gps = locations_neigh[['latitude_B', 'longitude_B']].values

tree = BallTree(stores_gps, leaf_size=15, metric='haversine')

And for each Neigh we want to closest ( k=1 ) City/Store:对于我们想要最接近 ( k=1 ) 城市/商店的每个 Neigh:

distance, index = tree.query(neigh_gps, k=1)
 
earth_radius = 6371

distance_in_km = distance * earth_radius

We can create a DataFrame of the result with我们可以创建一个结果的 DataFrame

pd.DataFrame({
    'Neighborhood' : locations_neigh.neigh_B,
    'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,
    'Distance_to_city' : distance_in_km[:,0]
})

This gave me这给了我

  Neighborhood Closest_city  Distance_to_city
0       Neigh1        City2      19112.334106
1       Neigh2        City2      19014.154744
2       Neigh3        City2      18851.168702
3       Neigh4        City2      19129.555188
4       Neigh5        City4      15498.181486

Since our output is different, there is some mistake to correct.由于我们的输出不同,有一些错误需要纠正。 Maybe swapped lat/long, I am just guessing here.也许交换纬度/经度,我只是在这里猜测。 But this is the approach you want, especially for the amounts of your data.但这是您想要的方法,尤其是对于数据量。


Edit: For the Full matrix, use编辑:对于完整矩阵,使用

from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('haversine')

earth_radius = 6371

haversine_distances = dist.pairwise(np.radians(stores_gps), np.radians(neigh_gps) )
haversine_distances *= earth_radius

This will give the full matrix, but be aware, for largers numbers it will take long, and expect hit memory limitation.这将提供完整的矩阵,但请注意,对于较大的数字,这将需要很长时间,并且预计会受到内存限制。

You could use numpy's np.argmin(haversine_distances, axis=1) to get similar results from the BallTree.您可以使用 numpy 的np.argmin(haversine_distances, axis=1)从 BallTree 获得类似的结果。 It will result in the index of the closest in distance, which can be used just like in the BallTree example.它将产生距离最近的索引,这可以像在 BallTree 示例中一样使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM