用于过滤最近距离对的 Python 代码

Question

This is my code.这是我的代码。 Please note that this is just a toy dataset, my real set contains about a 1000 entries in each table.请注意，这只是一个玩具数据集，我的真实数据集在每个表中包含大约 1000 个条目。

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

/some calc code here/

##df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]##


df_dist_long.to_csv('dist.csv',float_format='%.2f')

When i add df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] .当我添加df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] 。 I get this error我收到这个错误

 File "C:\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 656, in wrapper                                                    
    raise ValueError                                                                                                                                  
ValueError

Without it, the output is like so...没有它，输出就像这样......

    city_A  neigh_B Dist(km)
0   City1   Neigh1  6.45
1   City2   Neigh1  6.42
2   City3   Neigh1  7.93
3   City4   Neigh1  5.56
4   City1   Neigh2  8.25
5   City2   Neigh2  6.67
6   City3   Neigh2  8.55
7   City4   Neigh2  8.92
8   City1   Neigh3  7.01   ..... and so on

What I want is another table that filters the city closest to the Neighbour.我想要的是另一个过滤离邻居最近的城市的表。 So as an example, for 'Neigh1', City4 is the closest(least in distance).例如，对于“Neigh1”，City4 是最近的（距离最短）。 So I want the table as below所以我想要下表

city_A  neigh_B Dist(km)
0   City4   Neigh1  5.56
1   City3   Neigh2  4.32
2   City1   Neigh3  7.93
3   City2   Neigh4  3.21
4   City4   Neigh5  4.56
5   City5   Neigh6  6.67
6   City3   Neigh7  6.16
 ..... and so on

Doesn't matter if the city name gets repeated, I just want the closest pair saved to another csv.如果城市名称重复也没关系，我只想将最接近的一对保存到另一个 csv 中。 How can this be implemented, experts, please help!!这个怎么实现，请高手帮忙！！

Answer 1

You don't want to calculate the full distance matrix if you just want the closest city for each neighbourhood.如果您只想要每个街区最近的城市，您就不想计算完整的距离矩阵。

Here is a working code example, though I get different output than yours.这是一个有效的代码示例，尽管我得到的输出与您的不同。 Maybe a lat/long mistake.也许是纬度/经度错误。

I used your data我用了你的数据

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

Created a BallTree we can querie创建了一个我们可以查询的 BallTree

from sklearn.neighbors import BallTree
import numpy as np

stores_gps = locations_stores[['latitude_A', 'longitude_A']].values
neigh_gps = locations_neigh[['latitude_B', 'longitude_B']].values

tree = BallTree(stores_gps, leaf_size=15, metric='haversine')

And for each Neigh we want to closest ( k=1 ) City/Store:对于我们想要最接近 ( k=1 ) 城市/商店的每个 Neigh：

distance, index = tree.query(neigh_gps, k=1)
 
earth_radius = 6371

distance_in_km = distance * earth_radius

We can create a DataFrame of the result with我们可以创建一个结果的 DataFrame

pd.DataFrame({
    'Neighborhood' : locations_neigh.neigh_B,
    'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,
    'Distance_to_city' : distance_in_km[:,0]
})

This gave me这给了我

  Neighborhood Closest_city  Distance_to_city
0       Neigh1        City2      19112.334106
1       Neigh2        City2      19014.154744
2       Neigh3        City2      18851.168702
3       Neigh4        City2      19129.555188
4       Neigh5        City4      15498.181486

Since our output is different, there is some mistake to correct.由于我们的输出不同，有一些错误需要纠正。 Maybe swapped lat/long, I am just guessing here.也许交换纬度/经度，我只是在这里猜测。 But this is the approach you want, especially for the amounts of your data.但这是您想要的方法，尤其是对于数据量。

Edit: For the Full matrix, use编辑：对于完整矩阵，使用

from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('haversine')

earth_radius = 6371

haversine_distances = dist.pairwise(np.radians(stores_gps), np.radians(neigh_gps) )
haversine_distances *= earth_radius

This will give the full matrix, but be aware, for largers numbers it will take long, and expect hit memory limitation.这将提供完整的矩阵，但请注意，对于较大的数字，这将需要很长时间，并且预计会受到内存限制。

You could use numpy's np.argmin(haversine_distances, axis=1) to get similar results from the BallTree.您可以使用 numpy 的np.argmin(haversine_distances, axis=1)从 BallTree 获得类似的结果。 It will result in the index of the closest in distance, which can be used just like in the BallTree example.它将产生距离最近的索引，这可以像在 BallTree 示例中一样使用。

用于过滤最近距离对的 Python 代码

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-11 09:47:25

用于过滤最近距离对的 Python 代码

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-11 09:47:25

解决方案1
1 已采纳 2020-09-11 09:47:25