简体   繁体   中英

Pandas groupby with a custom function for a large dataset and a large number of groups

I wrote the function to calculate the values of the linear regression based on scipy.stats.linregress. But when I apply it to my dataset, the code runs for a long time, about 30 minutes. Is there any way to speed up the process? The dataset contains about 9 million rows and about 100 thousand groups . This function should be applied to 10 columns.

def linear_trend_timewise(x):
    """
    :param x: the time series to calculate the feature of. The index must be datetime.
    :type x: pandas.Series
    :param param: contains dictionaries {"attr": x} with x an string, the attribute name of the regression model
    :type param: list
    :return: the different feature values
    :return type: list
    """
    ix = x.index.get_level_values(1)

    # Get differences between each timestamp and the first timestamp in seconds.
    # Then convert to mins and reshape for linear regression
    times_seconds = (ix - ix[0]).total_seconds()
    times_mins = np.asarray(times_seconds / float(60))

    if np.all(np.isnan(x)):
        x = x
    else:
        x = x.interpolate(method='linear').values
        times_mins, x = times_mins[~np.isnan(x)], x[~np.isnan(x)]
    
    linReg = linregress(times_mins, x)

    return [getattr(linReg, config["attr"]) for config in param]

Applying the function

agged = feature_df.groupby(['group'])[cols].agg(linear_trend_timewise)

The general method is to skip useless lines.

x = x does not do anything, so better have this:

if ~np.all(np.isnan(x)):
    x = x.interpolate(method='linear').values
    times_mins, x = times_mins[~np.isnan(x)], x[~np.isnan(x)]

It reduced the time a little (on my small test data, 47ms to 40ms using %timeit).

Then the question is whether you really want to fill na with interpolation values to get the linear regression line, if not use x = x[~np.isnan(x)] in the begining to skip interpolation also.

Because I don't know what is in param , in order to skip for loop in the last line, you could use linReg.__dict__.values() then select the things you needed later.

Groupby is one by one executed, so parallel also helps.

from multiprocessing import Pool, cpu_count
import pandas as pd

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pd.concat(ret_list)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM