用於過濾最近距離對的 Python 代碼

Question

這是我的代碼。 請注意，這只是一個玩具數據集，我的真實數據集在每個表中包含大約 1000 個條目。

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

/some calc code here/

##df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]##


df_dist_long.to_csv('dist.csv',float_format='%.2f')

當我添加df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] 。 我收到這個錯誤

 File "C:\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 656, in wrapper                                                    
    raise ValueError                                                                                                                                  
ValueError

沒有它，輸出就像這樣......

    city_A  neigh_B Dist(km)
0   City1   Neigh1  6.45
1   City2   Neigh1  6.42
2   City3   Neigh1  7.93
3   City4   Neigh1  5.56
4   City1   Neigh2  8.25
5   City2   Neigh2  6.67
6   City3   Neigh2  8.55
7   City4   Neigh2  8.92
8   City1   Neigh3  7.01   ..... and so on

我想要的是另一個過濾離鄰居最近的城市的表。 例如，對於“Neigh1”，City4 是最近的（距離最短）。 所以我想要下表

city_A  neigh_B Dist(km)
0   City4   Neigh1  5.56
1   City3   Neigh2  4.32
2   City1   Neigh3  7.93
3   City2   Neigh4  3.21
4   City4   Neigh5  4.56
5   City5   Neigh6  6.67
6   City3   Neigh7  6.16
 ..... and so on

如果城市名稱重復也沒關系，我只想將最接近的一對保存到另一個 csv 中。 這個怎么實現，請高手幫忙！！

Answer 1

如果您只想要每個街區最近的城市，您就不想計算完整的距離矩陣。

這是一個有效的代碼示例，盡管我得到的輸出與您的不同。 也許是緯度/經度錯誤。

我用了你的數據

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

創建了一個我們可以查詢的 BallTree

from sklearn.neighbors import BallTree
import numpy as np

stores_gps = locations_stores[['latitude_A', 'longitude_A']].values
neigh_gps = locations_neigh[['latitude_B', 'longitude_B']].values

tree = BallTree(stores_gps, leaf_size=15, metric='haversine')

對於我們想要最接近 ( k=1 ) 城市/商店的每個 Neigh：

distance, index = tree.query(neigh_gps, k=1)
 
earth_radius = 6371

distance_in_km = distance * earth_radius

我們可以創建一個結果的 DataFrame

pd.DataFrame({
    'Neighborhood' : locations_neigh.neigh_B,
    'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,
    'Distance_to_city' : distance_in_km[:,0]
})

這給了我

  Neighborhood Closest_city  Distance_to_city
0       Neigh1        City2      19112.334106
1       Neigh2        City2      19014.154744
2       Neigh3        City2      18851.168702
3       Neigh4        City2      19129.555188
4       Neigh5        City4      15498.181486

由於我們的輸出不同，有一些錯誤需要糾正。 也許交換緯度/經度，我只是在這里猜測。 但這是您想要的方法，尤其是對於數據量。

編輯：對於完整矩陣，使用

from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('haversine')

earth_radius = 6371

haversine_distances = dist.pairwise(np.radians(stores_gps), np.radians(neigh_gps) )
haversine_distances *= earth_radius

這將提供完整的矩陣，但請注意，對於較大的數字，這將需要很長時間，並且預計會受到內存限制。

您可以使用 numpy 的np.argmin(haversine_distances, axis=1)從 BallTree 獲得類似的結果。 它將產生距離最近的索引，這可以像在 BallTree 示例中一樣使用。

用於過濾最近距離對的 Python 代碼

問題描述

1 個解決方案

解決方案1
1 已采納 2020-09-11 09:47:25

用於過濾最近距離對的 Python 代碼

問題描述

1 個解決方案

解決方案1 1 已采納 2020-09-11 09:47:25

解決方案1
1 已采納 2020-09-11 09:47:25