加快groupby.apply大熊貓匹配

Question

我在具有ID（在本例中為顏色）的笛卡爾平面上的點的數據框，以及在同一平面上定義其中心位置的一組圓。 圓的半徑均為2個單位。

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: points_df = pd.DataFrame([['green', 10., 10., 100],
               ['green', 5, 5, 200],
               ['blue', 9, 9, 3000 ],
               ['blue', 8, 8, 4000]], columns = ['color', 'x', 'y', 'height' ])

In [4]: points_df    
    color   x   y   height
0   green   10.0    10.0    100
1   green   5.0 5.0 200
2   blue    9.0 9.0 3000
3   blue    8.0 8.0 4000

In [5]: circles = np.array([[10, 10], [5, 5], [9,9], [8,8]])

對於每個圓圈，我想在每種顏色的點數據框中找到屬於圓圈的條目。 如果每種顏色有多個條目，那么我想在此圓圈中找到最大的“高度”值。

為了簡單起見，我們假設我有一個point_selection函數，該函數從circles數組中獲取一個數據point_selection和一行並執行此選擇。 然后，我將此函數應用於我的數據框，如下所示：

def point_selection(df, circle):
    #distance calculation and selection here
    return selected_df_row

groupby_color = points_df.groupby('color')
df_list = []

for circle in circles:
    selected = groupby_color.apply(point_selection, circle)
    df_list.append(selected.set_index('color', inplace=True))

final_df = pd.concat(df_list)

我目前正在對數據幀中的大量行（〜200000）和大量的圓（〜15000）執行此操作，請問有人有簡單的方法來加快這些計算嗎？ 據說groupby.apply相當慢，但是我想不出另一種方法來做到這一點。

Answer 1

看來您需要：

def point_selection(df, circle):
    #distance calculation and selection here
    return pd.Series(selected_df_row)

df = points_df.groupby('color').apply(point_selection, circle)

加快groupby.apply大熊貓匹配

問題描述

1 個解決方案

解決方案1
1 2018-02-14 11:37:04

加快groupby.apply大熊貓匹配

問題描述

1 個解決方案

解決方案1 1 2018-02-14 11:37:04

解決方案1
1 2018-02-14 11:37:04