简体   繁体   English

Pandas groupby 与自定义 function 用于大型数据集和大量组

[英]Pandas groupby with a custom function for a large dataset and a large number of groups

I wrote the function to calculate the values of the linear regression based on scipy.stats.linregress.我写了 function 来计算基于 scipy.stats.linregress 的线性回归值。 But when I apply it to my dataset, the code runs for a long time, about 30 minutes.但是当我将它应用到我的数据集时,代码会运行很长时间,大约 30 分钟。 Is there any way to speed up the process?有什么方法可以加快这个过程吗? The dataset contains about 9 million rows and about 100 thousand groups .该数据集包含大约900 万行和大约10 万组 This function should be applied to 10 columns.此 function 应应用于 10 列。

def linear_trend_timewise(x):
    """
    :param x: the time series to calculate the feature of. The index must be datetime.
    :type x: pandas.Series
    :param param: contains dictionaries {"attr": x} with x an string, the attribute name of the regression model
    :type param: list
    :return: the different feature values
    :return type: list
    """
    ix = x.index.get_level_values(1)

    # Get differences between each timestamp and the first timestamp in seconds.
    # Then convert to mins and reshape for linear regression
    times_seconds = (ix - ix[0]).total_seconds()
    times_mins = np.asarray(times_seconds / float(60))

    if np.all(np.isnan(x)):
        x = x
    else:
        x = x.interpolate(method='linear').values
        times_mins, x = times_mins[~np.isnan(x)], x[~np.isnan(x)]
    
    linReg = linregress(times_mins, x)

    return [getattr(linReg, config["attr"]) for config in param]

Applying the function应用 function

agged = feature_df.groupby(['group'])[cols].agg(linear_trend_timewise)

The general method is to skip useless lines.一般的方法是跳过无用的行。

x = x does not do anything, so better have this: x = x什么都不做,所以最好有这个:

if ~np.all(np.isnan(x)):
    x = x.interpolate(method='linear').values
    times_mins, x = times_mins[~np.isnan(x)], x[~np.isnan(x)]

It reduced the time a little (on my small test data, 47ms to 40ms using %timeit).它减少了一点时间(在我的小测试数据上,使用 %timeit 将 47 毫秒到 40 毫秒)。

Then the question is whether you really want to fill na with interpolation values to get the linear regression line, if not use x = x[~np.isnan(x)] in the begining to skip interpolation also.那么问题是你是否真的想用插值填充 na 以获得线性回归线,如果不是在开始时使用x = x[~np.isnan(x)]也跳过插值。

Because I don't know what is in param , in order to skip for loop in the last line, you could use linReg.__dict__.values() then select the things you needed later.因为我不知道param是什么,为了在最后一行跳过 for 循环,你可以使用linReg.__dict__.values()然后 select 是你以后需要的东西。

Groupby is one by one executed, so parallel also helps. Groupby 是一一执行的,因此并行也有帮助。

from multiprocessing import Pool, cpu_count
import pandas as pd

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pd.concat(ret_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM