為什么 pandas.grouby.mean 比並行實現快這么多

Question

我在一個非常大的數據集上使用如下所示的 pandas grouby mean 函數：

import pandas as pd
df=pd.read_csv("large_dataset.csv")
df.groupby(['variable']).mean()

看起來該函數沒有使用多處理，因此，我實現了一個並行版本：

import pandas as pd 
from multiprocessing import Pool, cpu_count 

def meanFunc(tmp_name, df_input): 
    df_res=df_input.mean().to_frame().transpose()
    return df_res 

def applyParallel(dfGrouped, func):
    num_process=int(cpu_count())
    with Pool(num_process) as p: 
        ret_list=p.starmap(func, [[name, group] for name, group in dfGrouped])
    return pd.concat(ret_list)

applyParallel(df.groupby(['variable']), meanFunc)

然而，似乎大熊貓實施仍遠低於我的並行實現更快。

我正在查看 pandas groupby 的源代碼，我看到它正在使用 cython。 是這個原因嗎？

def _cython_agg_general(self, how, alt=None, numeric_only=True,
                        min_count=-1):
    output = {}
    for name, obj in self._iterate_slices():
        is_numeric = is_numeric_dtype(obj.dtype)
        if numeric_only and not is_numeric:
            continue

        try:
            result, names = self.grouper.aggregate(obj.values, how,
                                                   min_count=min_count)
        except AssertionError as e:
            raise GroupByError(str(e))
        output[name] = self._try_cast(result, obj)

    if len(output) == 0:
        raise DataError('No numeric types to aggregate')

    return self._wrap_aggregated_output(output, names)

Answer 1

簡短回答 - 如果您想要這些類型的情況的並行性，請使用dask 。 你在你的方法中有它避免的陷阱。 它仍然可能不會更快，但會給你最好的鏡頭，並且在很大程度上是熊貓的替代品。

更長的答案

1) 並行性本質上會增加開銷，因此理想情況下，您並行的操作有些昂貴。 添加數字並不是特別的 - 在這里使用 cython 是對的，您正在查看的代碼是調度邏輯。 實際的核心 cython 在這里，它轉化為一個非常簡單的 c 循環。

2）您正在使用多處理 - 這意味着每個進程都需要獲取數據的副本。 這是昂貴的。 通常，由於 GIL，您必須在 python 中執行此操作 - 您實際上可以（並且 dask 確實）在這里使用線程，因為 Pandas 操作在 C 中並釋放 GIL。

3）正如@AKX 在評論中指出的那樣 - 並行化之前的迭代（ ... name, group in dfGrouped ）也相對昂貴 - 它為每個組構建新的子數據幀。 原始的 Pandas 算法在原地迭代數據。

為什么 pandas.grouby.mean 比並行實現快這么多

問題描述

1 個解決方案

解決方案1
3 已采納 2019-02-04 22:00:44

為什么 pandas.grouby.mean 比並行實現快這么多

問題描述

1 個解決方案

解決方案1 3 已采納 2019-02-04 22:00:44

解決方案1
3 已采納 2019-02-04 22:00:44