簡體   English   中英

計算點之間最短(歐幾里得)距離的最快方法,在 pandas dataframe

[英]Fastest way to calculate the shortest (euclidean) distance between points, in pandas dataframe

考慮以下 pandas dataframe:

print(df)

     Id      X      Y Type  X of Closest  Y of Closest
0   201  73.91  34.84    A           NaN           NaN
1   201  74.67  32.64    A           NaN           NaN
2   201  74.00  33.20    A           NaN           NaN
3   201  71.46  27.70    A           NaN           NaN
4   201  69.32  35.42    A           NaN           NaN
5   201  75.06  24.00    B           NaN           NaN
6   201  74.11  16.64    B           NaN           NaN
7   201  73.37  18.73    B           NaN           NaN
8   201  56.63  26.90    B           NaN           NaN
9   201  73.35  38.83    B           NaN           NaN
10  512  74.15  28.90    A           NaN           NaN
11  512  75.82  17.56    A           NaN           NaN
12  512  74.78  33.21    A           NaN           NaN
13  512  75.43  32.41    A           NaN           NaN
14  512  75.90  25.12    A           NaN           NaN
15  512  79.76  29.49    B           NaN           NaN
16  512  76.47  36.91    B           NaN           NaN
17  512  74.70  19.19    B           NaN           NaN
18  512  78.75  30.53    B           NaN           NaN
19  512  74.60  31.88    B           NaN           NaN

請注意,對於每個 Id,總是有 10 行,A 類型有 5 行,B 類型有 5 行。

我想創建 2 列,“最近的 X”和“最近的 Y”。 我的意思是,X,Y 對(每個 Id 的類型相反)是最短的歐幾里德距離。

第一行示例:與 (73.91, 34.84) 最接近的(B 型)對是 (73.35,38.83) 對 - 歐幾里得距離為 4.03。

一種(可能?)方法是構造 10 列 - 每個 Id 中的點之間的歐幾里德距離。 然后 select 與對面類型的最小歐幾里得距離,我相信會有更快的方法。 盡管。

對於快速(編碼)解決方案,我們可以在 groupby 上使用apply

from scipy.spatial import distance_matrix

def get_min_dist(x):
    # compute distance matrix
    tmp = distance_matrix(x.iloc[:5], x.iloc[5:])

    # get index min of corresponding types
    idx = np.concatenate((np.argmin(tmp,1)+5),  # type A to type B
                          np.argmin(tmp, 0)     # type B to type A
                        )

    return pd.DataFrame(x.iloc[idx].values, 
                        index=x.index, 
                        columns=[a+'_closest' for a in x.columns])

df.groupby('Id')[['X','Y']].apply(get_min_dist)

Output:

    X_closest  Y_closest
0       73.35      38.83
1       73.35      38.83
2       73.35      38.83
3       75.06      24.00
4       73.35      38.83
5       71.46      27.70
6       71.46      27.70
7       71.46      27.70
8       71.46      27.70
9       73.91      34.84
10      74.60      31.88
11      74.70      19.19
12      74.60      31.88
13      74.60      31.88
14      79.76      29.49
15      75.43      32.41
16      74.78      33.21
17      75.82      17.56
18      75.43      32.41
19      75.43      32.41

這是我使用 Numpy 廣播的解決方案

df = pd.DataFrame([[201, 73.91, 34.84, 'A', np.nan, np.nan], [201, 74.67, 32.64, 'A', np.nan, np.nan], [201, 74.0, 33.2, 'A', np.nan, np.nan], [201, 71.46, 27.7, 'A', np.nan, np.nan], [201, 69.32, 35.42, 'A', np.nan, np.nan], [201, 75.06, 24.0, 'B', np.nan, np.nan], [201, 74.11, 16.64, 'B', np.nan, np.nan], [201, 73.37, 18.73, 'B', np.nan, np.nan], [201, 56.63, 26.9, 'B', np.nan, np.nan], [201, 73.35, 38.83, 'B', np.nan, np.nan], [512, 74.15, 28.9, 'A', np.nan, np.nan], [512, 75.82, 17.56, 'A', np.nan, np.nan], [512, 74.78, 33.21, 'A', np.nan, np.nan], [512, 75.43, 32.41, 'A', np.nan, np.nan], [512, 75.9, 25.12, 'A', np.nan, np.nan], [512, 79.76, 29.49, 'B', np.nan, np.nan], [512, 76.47, 36.91, 'B', np.nan, np.nan], [512, 74.7, 19.19, 'B', np.nan, np.nan], [512, 78.75, 30.53, 'B', np.nan, np.nan], [512, 74.6, 31.88, 'B', np.nan, np.nan]], columns=('Id', 'X', 'Y', 'Type', 'X-of-Closest', 'Y-of-Closest'))

## assuming that df is sorted by ID and Type we can create this 4 dimensional array where
## dim0->no of unique ids, dim1-> 2 (type A, B), dim2->5 values of each type, dim3->X or Y
values = df[['X','Y']].values.reshape(-1,2, 5, 2).copy()

## values[:,0,:,:] will take rows of type A for all ids
## and the broadcast repeates values of type A and B 5 times each
## which represents 5X5=25 possible pairs of points of type A and B
diff = values[:,0,:,:][:,:,np.newaxis,:] - values[:,1,:,:][:,np.newaxis,:,:]

## get index of min distance for type A and B 
ind1 = np.argmin(np.sum(diff**2, axis=-1), axis=-1)
ind2 = np.argmin(np.sum(diff**2, axis=-1), axis=-2)

## use the index to set point with min distance to other type
closest_points = np.empty_like(values)
closest_points[:,0] = values[0,1,ind1]
closest_points[:,1] = values[0,0,ind2]

## assign result back to df
df[["X-of-Closest","Y-of-Closest"]] = closest_points.reshape(-1,2)
print(df)

結果

     Id      X      Y Type  X-of-Closest  Y-of-Closest
0   201  73.91  34.84    A         73.35         38.83
1   201  74.67  32.64    A         73.35         38.83
2   201  74.00  33.20    A         73.35         38.83
3   201  71.46  27.70    A         75.06         24.00
4   201  69.32  35.42    A         73.35         38.83
5   201  75.06  24.00    B         71.46         27.70
6   201  74.11  16.64    B         71.46         27.70
7   201  73.37  18.73    B         71.46         27.70
8   201  56.63  26.90    B         71.46         27.70
9   201  73.35  38.83    B         73.91         34.84
10  512  74.15  28.90    A         73.35         38.83
11  512  75.82  17.56    A         73.37         18.73
12  512  74.78  33.21    A         73.35         38.83
13  512  75.43  32.41    A         73.35         38.83
14  512  75.90  25.12    A         75.06         24.00
15  512  79.76  29.49    B         71.46         27.70
16  512  76.47  36.91    B         74.00         33.20
17  512  74.70  19.19    B         74.67         32.64
18  512  78.75  30.53    B         71.46         27.70
19  512  74.60  31.88    B         71.46         27.70

有關廣播如何工作的詳細信息,請查看此博客的廣播部分

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM