在 pandas 中并行化 groupby 和 agg 的有效方法

Question

我想并行化以下 function 并加快 groupby 进程：

df = pd.DataFrame({'A': ['a', 'a', 'b', 'c', 'b', 'b'], 'B': ['e1', 'e1', 'e2', 'e3', 'e4', 'e2'], 'C':[[1,2,3], [4,1,5], [2,5,1], [6,2,6], [7,1,3], [7,5,8]]})
df = df.groupby(['A', 'B'], as_index=False).agg({'C': sum})

我尝试了以下并行 function 但它并没有减少所花费的时间：

from functools import partial
import multiprocessing as mp
import os

def applyParallel(dfGrouped, func, *args):
    p=mp.Pool(os.cpu_count())
    result=p.map(partial(func, *args), [group for name, group in dfGrouped])
    p.close()
    return(result)

def aggregate_fun(data):
    data = data.groupby(['A', 'B'], as_index=False).agg({'C': sum})
    return data

df1 = df.groupby(['A', 'B'], as_index=False)

df2 = applyParallel(df1, aggregate_fun)
df_grouped = pd.concat(df2, axis=0)

如何并行化或减少上述 function 所花费的时间，我有大约 300 万行，这需要很多时间。

Answer 1

当您仅使用单列时，您可以减少 groupby 请求的时间，例如：

import pandas as pd

df = pd.DataFrame({'A': ['a', 'a', 'b', 'c', 'b', 'b'], 'B': ['e1', 'e1', 'e2', 'e3', 'e4', 'e2'], 'C':[[1,2,3], [4,1,5], [2,5,1], [6,2,6], [7,1,3], [7,5,8]]})
df['new_col'] = df['A']+df['B']

df = df.groupby(['new_col'], as_index=False).agg({'C': sum})

处理时间 = 2.6 ms 而不是 ['A', 'B'] 的 3.5 ms 并且创建新列非常便宜（0.25 ms）。

在 pandas 中并行化 groupby 和 agg 的有效方法

问题描述

1 个解决方案

解决方案1
0 2021-03-01 07:56:33

在 pandas 中并行化 groupby 和 agg 的有效方法

问题描述

1 个解决方案

解决方案1 0 2021-03-01 07:56:33

解决方案1
0 2021-03-01 07:56:33