加速 Pandas DataFrame 上的嵌套循环

Question

I have a pandas.DataFrame containing many coordinates of objects arranged into identity , x , and y .我有一个pandas.DataFrame其中包含许多排列为identity 、 x和y的对象坐标。

I'm trying to find the closest objects across two identities.我试图在两个身份中找到最近的对象。 To clear up what I mean take this code:为了弄清楚我的意思，请使用以下代码：

import numpy as np
import pandas as pd

# Generate random data
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df = pd.concat([df_identity_1, df_identity_2])

>>> df
      identity         x         y
0            1 -1.784748  2.085517
1            1  0.324645 -1.584790
2            1 -0.044623 -0.348576
3            1  0.802035  1.362336
4            1 -0.091508 -0.655114
...        ...       ...       ...
9995         2  0.939491  0.304964
9996         2 -0.233707 -0.135265
9997         2  0.792494  1.157236
9998         2 -0.385080 -0.021226
9999         2  0.105970 -0.042135

Currently, I have to go through each row and iterate through the entire DataFrame again to find the closest coordinate.目前，我必须遍历每一行并再次遍历整个DataFrame以找到最近的坐标。

# Function to find the absolute / Euclidean distance between two coordinates
def euclidean(x1, y1, x2, y2):
    a = np.array((int(x1), int(y1)))
    b = np.array((int(x2), int(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    return min(euclidean(row.x, row.y, r.x, r.y) for r in df.itertuples(index=False))

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)

This code is fully functional – but when I have a large dataset (+100k coordinates) this "nested" for-loop is extremely time consuming.这段代码功能齐全——但是当我有一个大数据集（+100k 坐标）时，这个“嵌套”的 for 循环非常耗时。

Is there some functionality that could speed up this concept or a faster approach altogether?是否有一些功能可以加速这个概念或完全更快的方法？

Answer 1

The best way to solve this problem is to use a spatial data structure.解决这个问题的最好方法是使用空间数据结构。 These data structures allow you to dramatically reduce the size of the search space when you need to do these kinds of queries.当您需要执行此类查询时，这些数据结构允许您显着减少搜索空间的大小。 SciPy provides a KD-tree for nearest neighbor queries, but it would be a bit of a hassle to scale this to multiple machines (if the size of your data requires that). SciPy 为最近邻查询提供了 KD 树，但是将其扩展到多台机器会有点麻烦（如果您的数据大小需要）。

If you need to scale beyond that, you'd probably want to use dedicated geospatial analytics tools.如果您需要扩展，您可能需要使用专用的地理空间分析工具。

In general, if you want to speed up something like this, you need to make tradeoffs between iterative approaches and memory-intensity.一般来说，如果你想加速这样的事情，你需要在迭代方法和内存强度之间进行权衡。

However, in this case, your core bottlenecks are:但是，在这种情况下，您的核心瓶颈是：

Iterating row by row逐行迭代
Calling euclidean once per every pair of rows , rather than once per dataset .每对行调用一次euclidean ，而不是每个数据集调用一次。

NumPy functions like norm are columnar in nature and you should take advantage of that by calling it on the entire array of data. NumPy 函数（如norm本质上是柱状的，您应该通过在整个数据数组上调用它来利用这一点。 If each of your dataframes are 10,000 rows, you're calling norm 100 million times.如果您的每个数据帧都是 10,000 行，那么您调用了norm 1 亿次。 Just tweaking this a bit to make that change should help you a lot.只需稍微调整一下以进行更改就应该对您有很大帮助。

If you want to do this in Python at scale and couldn't use a spatial data structure effectively (and don't want to use heuristics to reduce the search space), something like the following would probably work: cross-product join the two tables, calculate the euclidean distance once with a single columnar operation, and use a groupby-aggregation ( min ) to get the closest points.如果您想在 Python 中大规模执行此操作并且无法有效地使用空间数据结构（并且不想使用启发式方法来减少搜索空间），则可能会使用以下内容：cross-product join the two表，使用单个柱状操作计算一次欧几里得距离，并使用 groupby-aggregation ( min ) 来获得最近的点。

This would be much faster and much more memory intensive than iterating row by row like you are doing, but could easily be scaled with something like Dask (or Spark).这比像您这样做的逐行迭代要快得多，内存密集得多，但可以很容易地使用 Dask（或 Spark）之类的东西进行扩展。

I'm going to use only a few rows to illustrate the logic.我将仅使用几行来说明逻辑。

import numpy as np
import pandas as pd

# Generate random data
nrows = 3
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_1.reset_index(drop=False, inplace=True)
df_identity_2.reset_index(drop=False, inplace=True)

Notice how I'm creating a unique index in addition to the identity flag for each dataframe.请注意，除了每个数据帧的identity标志之外，我还如何创建唯一索引。 This will come in handy later for the groupby.这将在稍后为 groupby 派上用场。 Next, I can do the cross-product join.接下来，我可以进行跨产品连接。 This would be cleaner if we used different column names, but I'll keep it consistent with your example.如果我们使用不同的列名，这会更清晰，但我会使其与您的示例保持一致。 This join will quickly go out of memory in pure Pandas as the dataset grows, but Dask ( https://dask.org/ ) would be able to handle it quite well.随着数据集的增长，这种连接将在纯 Pandas 中迅速耗尽内存，但 Dask ( https://dask.org/ ) 能够很好地处理它。

def cross_product(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)

crossprod = cross_product(df_identity_1, df_identity_2)
crossprod
index_x identity_x  x_x y_x index_y identity_y  x_y y_y
0   0   1   1.660468    -1.954339   0   2   -0.431543   0.500864
1   0   1   1.660468    -1.954339   1   2   -0.607647   -0.436480
2   0   1   1.660468    -1.954339   2   2   1.613126    -0.696860
3   1   1   0.153419    0.619493    0   2   -0.431543   0.500864
4   1   1   0.153419    0.619493    1   2   -0.607647   -0.436480
5   1   1   0.153419    0.619493    2   2   1.613126    -0.696860
6   2   1   -0.592440   -0.299046   0   2   -0.431543   0.500864
7   2   1   -0.592440   -0.299046   1   2   -0.607647   -0.436480
8   2   1   -0.592440   -0.299046   2   2   1.613126    -0.696860

Next, we just need to calculate the minimum distance for each row and then group by each index_x and index_y (respectively) and get the minimum distance value.接下来，我们只需要计算每一行的最小距离，然后按每个index_x和index_y （分别） index_y并获得最小距离值。 Notice how we can do this with a single call to norm , rather than one call per row.请注意我们如何通过一次对norm调用来做到这一点，而不是每行一次调用。

crossprod['dist'] = np.linalg.norm(crossprod[['x_x', 'y_x']].values - crossprod[['x_y', 'y_y']].values, axis=1)
closest_per_identity1 = crossprod.groupby(['index_x']).agg({'dist':'min'})
closest_per_identity2 = crossprod.groupby(['index_y']).agg({'dist':'min'})

closest_per_identity1
dist
index_x 
0   1.258370
1   0.596869
2   0.138273

closest_per_identity2
dist
index_y 
0   0.596869
1   0.138273
2   1.258370

Comparing to your original example on the same data.与相同数据的原始示例进行比较。 Note that I changed your int calls to floats and your itertuples to iterate through d , rather than df (as otherwise you're comparing each point to itself).请注意，我将您的int调用更改为floats并将您的 itertuples 更改为遍历d ，而不是df （否则您将每个点与自身进行比较）。

df = pd.concat([df_identity_1, df_identity_2])

def euclidean(x1, y1, x2, y2):
    a = np.array((float(x1), float(y1)))
    b = np.array((float(x2), float(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    r = min(euclidean(row.x, row.y, r.x, r.y) for r in d.itertuples(index=False))
    return r

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)
df
index   identity    x   y   closest_coord
0   0   1   1.660468    -1.954339   1.258370
1   1   1   0.153419    0.619493    0.596869
2   2   1   -0.592440   -0.299046   0.138273
0   0   2   -0.431543   0.500864    0.596869
1   1   2   -0.607647   -0.436480   0.138273
2   2   2   1.613126    -0.696860   1.258370

加速 Pandas DataFrame 上的嵌套循环

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-12-22 20:45:40

加速 Pandas DataFrame 上的嵌套循环

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-12-22 20:45:40

解决方案1
1 已采纳 2019-12-22 20:45:40