[英]How do I speed up applying a function to a large pandas dataframe?
So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever.所以我昨天开始将 function 应用到一个大小合适的数据集(600 万行),但它需要永远。 I'm even trying to use pandarallel but that is not working well either.
我什至在尝试使用 pandarallel,但这也不是很好。 In any case, here is the code that I'm using...
无论如何,这是我正在使用的代码......
def classifyForecast(dataframe):
buckets = len(dataframe[dataframe['QUANTITY'] != 0])
try:
adi = dataframe.shape[0] / buckets
cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()
if adi < 1.32:
if cov < .49:
dataframe['TYPE'] = 'Smooth'
else:
dataframe['TYPE'] = 'Erratic'
else:
if cov < .49:
dataframe['TYPE'] = 'Intermittent'
else:
dataframe['TYPE'] = 'Lumpy'
except:
dataframe['TYPE'] = 'Smooth'
try:
dataframe['ADI'] = adi
except:
dataframe['ADI'] = np.inf
try:
dataframe['COV'] = cov
except:
dataframe['COV'] = np.inf
return dataframe
from pandarallel import pandarallel
pandarallel.initialize()
def quick_classification(df):
return df.parallel_apply(classifyForecast(df))
Also, please note that I am splitting the dataframe up into batches.另外,请注意我将 dataframe 分成批次。 I don't want the function to work on each row, but instead I want it to work on the chunks.
我不希望 function 在每一行上工作,而是我希望它在块上工作。 That way I can get the
.mean()
and .std()
of specific columns.这样我就可以获得特定列的
.mean()
和.std()
。
It shouldn't take 48 hours to complete.它不应该需要 48 小时才能完成。 How do I speed this up?
我该如何加快速度?
It looks like mean
and std
are the only calculations here so I'm guessing that this is the bottleneck.看起来
mean
和std
是这里唯一的计算,所以我猜这是瓶颈。
You could try speeding it up with numba
.您可以尝试使用
numba
加快速度。
from numba import njit
import numpy as np
@njit(parallel=True)
def numba_mean(x):
return np.mean(x)
@njit(parallel=True)
def numba_std(x):
return np.std(x)
cov = numba_std(dataframe['QUANTITY'].values) / numba_mean(dataframe['QUANTITY'].values)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.