加速 Pandas DataFrame 上的嵌套循環

Question

我有一個pandas.DataFrame其中包含許多排列為identity 、 x和y的對象坐標。

我試圖在兩個身份中找到最近的對象。 為了弄清楚我的意思，請使用以下代碼：

import numpy as np
import pandas as pd

# Generate random data
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df = pd.concat([df_identity_1, df_identity_2])

>>> df
      identity         x         y
0            1 -1.784748  2.085517
1            1  0.324645 -1.584790
2            1 -0.044623 -0.348576
3            1  0.802035  1.362336
4            1 -0.091508 -0.655114
...        ...       ...       ...
9995         2  0.939491  0.304964
9996         2 -0.233707 -0.135265
9997         2  0.792494  1.157236
9998         2 -0.385080 -0.021226
9999         2  0.105970 -0.042135

目前，我必須遍歷每一行並再次遍歷整個DataFrame以找到最近的坐標。

# Function to find the absolute / Euclidean distance between two coordinates
def euclidean(x1, y1, x2, y2):
    a = np.array((int(x1), int(y1)))
    b = np.array((int(x2), int(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    return min(euclidean(row.x, row.y, r.x, r.y) for r in df.itertuples(index=False))

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)

這段代碼功能齊全——但是當我有一個大數據集（+100k 坐標）時，這個“嵌套”的 for 循環非常耗時。

是否有一些功能可以加速這個概念或完全更快的方法？

Answer 1

解決這個問題的最好方法是使用空間數據結構。 當您需要執行此類查詢時，這些數據結構允許您顯着減少搜索空間的大小。 SciPy 為最近鄰查詢提供了 KD 樹，但是將其擴展到多台機器會有點麻煩（如果您的數據大小需要）。

如果您需要擴展，您可能需要使用專用的地理空間分析工具。

一般來說，如果你想加速這樣的事情，你需要在迭代方法和內存強度之間進行權衡。

但是，在這種情況下，您的核心瓶頸是：

逐行迭代
每對行調用一次euclidean ，而不是每個數據集調用一次。

NumPy 函數（如norm本質上是柱狀的，您應該通過在整個數據數組上調用它來利用這一點。 如果您的每個數據幀都是 10,000 行，那么您調用了norm 1 億次。 只需稍微調整一下以進行更改就應該對您有很大幫助。

如果您想在 Python 中大規模執行此操作並且無法有效地使用空間數據結構（並且不想使用啟發式方法來減少搜索空間），則可能會使用以下內容：cross-product join the two表，使用單個柱狀操作計算一次歐幾里得距離，並使用 groupby-aggregation ( min ) 來獲得最近的點。

這比像您這樣做的逐行迭代要快得多，內存密集得多，但可以很容易地使用 Dask（或 Spark）之類的東西進行擴展。

我將僅使用幾行來說明邏輯。

import numpy as np
import pandas as pd

# Generate random data
nrows = 3
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_1.reset_index(drop=False, inplace=True)
df_identity_2.reset_index(drop=False, inplace=True)

請注意，除了每個數據幀的identity標志之外，我還如何創建唯一索引。 這將在稍后為 groupby 派上用場。 接下來，我可以進行跨產品連接。 如果我們使用不同的列名，這會更清晰，但我會使其與您的示例保持一致。 隨着數據集的增長，這種連接將在純 Pandas 中迅速耗盡內存，但 Dask ( https://dask.org/ ) 能夠很好地處理它。

def cross_product(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)

crossprod = cross_product(df_identity_1, df_identity_2)
crossprod
index_x identity_x  x_x y_x index_y identity_y  x_y y_y
0   0   1   1.660468    -1.954339   0   2   -0.431543   0.500864
1   0   1   1.660468    -1.954339   1   2   -0.607647   -0.436480
2   0   1   1.660468    -1.954339   2   2   1.613126    -0.696860
3   1   1   0.153419    0.619493    0   2   -0.431543   0.500864
4   1   1   0.153419    0.619493    1   2   -0.607647   -0.436480
5   1   1   0.153419    0.619493    2   2   1.613126    -0.696860
6   2   1   -0.592440   -0.299046   0   2   -0.431543   0.500864
7   2   1   -0.592440   -0.299046   1   2   -0.607647   -0.436480
8   2   1   -0.592440   -0.299046   2   2   1.613126    -0.696860

接下來，我們只需要計算每一行的最小距離，然后按每個index_x和index_y （分別） index_y並獲得最小距離值。 請注意我們如何通過一次對norm調用來做到這一點，而不是每行一次調用。

crossprod['dist'] = np.linalg.norm(crossprod[['x_x', 'y_x']].values - crossprod[['x_y', 'y_y']].values, axis=1)
closest_per_identity1 = crossprod.groupby(['index_x']).agg({'dist':'min'})
closest_per_identity2 = crossprod.groupby(['index_y']).agg({'dist':'min'})

closest_per_identity1
dist
index_x 
0   1.258370
1   0.596869
2   0.138273

closest_per_identity2
dist
index_y 
0   0.596869
1   0.138273
2   1.258370

與相同數據的原始示例進行比較。 請注意，我將您的int調用更改為floats並將您的 itertuples 更改為遍歷d ，而不是df （否則您將每個點與自身進行比較）。

df = pd.concat([df_identity_1, df_identity_2])

def euclidean(x1, y1, x2, y2):
    a = np.array((float(x1), float(y1)))
    b = np.array((float(x2), float(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    r = min(euclidean(row.x, row.y, r.x, r.y) for r in d.itertuples(index=False))
    return r

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)
df
index   identity    x   y   closest_coord
0   0   1   1.660468    -1.954339   1.258370
1   1   1   0.153419    0.619493    0.596869
2   2   1   -0.592440   -0.299046   0.138273
0   0   2   -0.431543   0.500864    0.596869
1   1   2   -0.607647   -0.436480   0.138273
2   2   2   1.613126    -0.696860   1.258370

加速 Pandas DataFrame 上的嵌套循環

問題描述

1 個解決方案

解決方案1
1 已采納 2019-12-22 20:45:40

加速 Pandas DataFrame 上的嵌套循環

問題描述

1 個解決方案

解決方案1 1 已采納 2019-12-22 20:45:40

解決方案1
1 已采納 2019-12-22 20:45:40