如何加快将 function 应用到大型 pandas dataframe 的速度？

Question

So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever.所以我昨天开始将 function 应用到一个大小合适的数据集（600 万行），但它需要永远。 I'm even trying to use pandarallel but that is not working well either.我什至在尝试使用 pandarallel，但这也不是很好。 In any case, here is the code that I'm using...无论如何，这是我正在使用的代码......

def classifyForecast(dataframe):

    buckets = len(dataframe[dataframe['QUANTITY'] != 0])

    try:
        adi = dataframe.shape[0] / buckets
        cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()

        if adi < 1.32:
            if cov < .49:
                dataframe['TYPE'] = 'Smooth'
            else:
                dataframe['TYPE'] = 'Erratic'
        else:
            if cov < .49:
                dataframe['TYPE'] = 'Intermittent'
            else:
                dataframe['TYPE'] = 'Lumpy'

    except:
        dataframe['TYPE'] = 'Smooth'
    
    try:
        dataframe['ADI'] = adi
    except:
        dataframe['ADI'] = np.inf
    try:
        dataframe['COV'] = cov
    except:
        dataframe['COV'] = np.inf
    

    return dataframe

from pandarallel import pandarallel

pandarallel.initialize()

def quick_classification(df):
    return df.parallel_apply(classifyForecast(df))

Also, please note that I am splitting the dataframe up into batches.另外，请注意我将 dataframe 分成批次。 I don't want the function to work on each row, but instead I want it to work on the chunks.我不希望 function 在每一行上工作，而是我希望它在块上工作。 That way I can get the .mean() and .std() of specific columns.这样我就可以获得特定列的.mean()和.std() 。

It shouldn't take 48 hours to complete.它不应该需要 48 小时才能完成。 How do I speed this up?我该如何加快速度？

Answer 1

It looks like mean and std are the only calculations here so I'm guessing that this is the bottleneck.看起来mean和std是这里唯一的计算，所以我猜这是瓶颈。

You could try speeding it up with numba .您可以尝试使用numba加快速度。

from numba import njit
import numpy as np

@njit(parallel=True)
def numba_mean(x):
    return np.mean(x)

@njit(parallel=True)
def numba_std(x):
    return np.std(x)

cov = numba_std(dataframe['QUANTITY'].values) / numba_mean(dataframe['QUANTITY'].values)

如何加快将 function 应用到大型 pandas dataframe 的速度？

问题描述

1 个解决方案

解决方案1
0 2022-01-21 20:01:48

如何加快将 function 应用到大型 pandas dataframe 的速度？

问题描述

1 个解决方案

解决方案1 0 2022-01-21 20:01:48

解决方案1
0 2022-01-21 20:01:48