简体   繁体   English

如何加快将 function 应用到大型 pandas dataframe 的速度?

[英]How do I speed up applying a function to a large pandas dataframe?

So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever.所以我昨天开始将 function 应用到一个大小合适的数据集(600 万行),但它需要永远。 I'm even trying to use pandarallel but that is not working well either.我什至在尝试使用 pandarallel,但这也不是很好。 In any case, here is the code that I'm using...无论如何,这是我正在使用的代码......

def classifyForecast(dataframe):

    buckets = len(dataframe[dataframe['QUANTITY'] != 0])

    try:
        adi = dataframe.shape[0] / buckets
        cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()

        if adi < 1.32:
            if cov < .49:
                dataframe['TYPE'] = 'Smooth'
            else:
                dataframe['TYPE'] = 'Erratic'
        else:
            if cov < .49:
                dataframe['TYPE'] = 'Intermittent'
            else:
                dataframe['TYPE'] = 'Lumpy'

    except:
        dataframe['TYPE'] = 'Smooth'
    
    try:
        dataframe['ADI'] = adi
    except:
        dataframe['ADI'] = np.inf
    try:
        dataframe['COV'] = cov
    except:
        dataframe['COV'] = np.inf
    

    return dataframe

from pandarallel import pandarallel

pandarallel.initialize()

def quick_classification(df):
    return df.parallel_apply(classifyForecast(df))

Also, please note that I am splitting the dataframe up into batches.另外,请注意我将 dataframe 分成批次。 I don't want the function to work on each row, but instead I want it to work on the chunks.我不希望 function 在每一行上工作,而是我希望它在块上工作。 That way I can get the .mean() and .std() of specific columns.这样我就可以获得特定列的.mean().std()

It shouldn't take 48 hours to complete.它不应该需要 48 小时才能完成。 How do I speed this up?我该如何加快速度?

It looks like mean and std are the only calculations here so I'm guessing that this is the bottleneck.看起来meanstd是这里唯一的计算,所以我猜这是瓶颈。

You could try speeding it up with numba .您可以尝试使用numba加快速度。

from numba import njit
import numpy as np

@njit(parallel=True)
def numba_mean(x):
    return np.mean(x)

@njit(parallel=True)
def numba_std(x):
    return np.std(x)

cov = numba_std(dataframe['QUANTITY'].values) / numba_mean(dataframe['QUANTITY'].values)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM