简体   繁体   中英

Why is pandas.grouby.mean so much faster than paralleled implementation

I was using the pandas grouby mean function like the following on a very large dataset:

import pandas as pd
df=pd.read_csv("large_dataset.csv")
df.groupby(['variable']).mean() 

It looks like the function is not using multi-processing, and therefore, I implemented a paralleled version:

import pandas as pd 
from multiprocessing import Pool, cpu_count 

def meanFunc(tmp_name, df_input): 
    df_res=df_input.mean().to_frame().transpose()
    return df_res 

def applyParallel(dfGrouped, func):
    num_process=int(cpu_count())
    with Pool(num_process) as p: 
        ret_list=p.starmap(func, [[name, group] for name, group in dfGrouped])
    return pd.concat(ret_list)

applyParallel(df.groupby(['variable']), meanFunc)

However, it seems that pandas implementation is still way faster than my parallel implementation.

I am looking at the source code for pandas groupby, and I see that it is using cython. Is that the reason?

def _cython_agg_general(self, how, alt=None, numeric_only=True,
                        min_count=-1):
    output = {}
    for name, obj in self._iterate_slices():
        is_numeric = is_numeric_dtype(obj.dtype)
        if numeric_only and not is_numeric:
            continue

        try:
            result, names = self.grouper.aggregate(obj.values, how,
                                                   min_count=min_count)
        except AssertionError as e:
            raise GroupByError(str(e))
        output[name] = self._try_cast(result, obj)

    if len(output) == 0:
        raise DataError('No numeric types to aggregate')

    return self._wrap_aggregated_output(output, names)

Short answer - use dask if you want parallelism for these type of cases. You have pitfalls in your approach that it avoids. It still might not be faster, but will give you the best shot and is a largely drop-in replacement for pandas.

Longer answer

1) Parallelism inherently adds overhead, so ideally the operation you're paralleling is somewhat expensive. Adding up numbers isn't especially - you're right that cython is used here, the code you're looking at is dispatch logic. The actual core cython is here , which translates down to a very simple c-loop.

2) You're using multi-processing - which means that each process needs to take a copy of the data. This is expensive. Normally you have to do this in python because of the GIL - you actually can (and dask does) use threads here, because the pandas operation is in C and releases the GIL.

3) As @AKX noted in the comments - the iteration before you parallelize ( ... name, group in dfGrouped ) is also relatively expensive - its constructing new sub data frames for each group. The original pandas algorithm iterates over the data in place.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM