简体   繁体   中英

Speed up rolling window in Pandas

I have this code which works fine and gives me the result I am looking for. It loops through a list of window sizes to create rolling aggregates for each metric in the sum_metric_list, min_metric_list and max_metric_list.

# create the rolling aggregations for each window
for window in constants.AGGREGATION_WINDOW:
    # get the sum and count sums
    sum_metrics_names_list = [x[6:] + "_1_" + str(window) for x in sum_metrics_list]
    adt_df[sum_metrics_names_list] = adt_df.groupby('athlete_id')[sum_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).sum())

    # get the min of mins
    min_metrics_names_list = [x[6:] + "_1_" + str(window) for x in min_metrics_list]
    adt_df[min_metrics_names_list] = adt_df.groupby('athlete_id')[min_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).min())

    # get the max of max
    max_metrics_names_list = [x[6:] + "_1_" + str(window) for x in max_metrics_list]
    adt_df[max_metrics_names_list] = adt_df.groupby('athlete_id')[max_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).max())

It works well on small datasets but as soon as I run it on my full data with >3000 metrics and 40 windows it becomes very slow. Is there any way to optimise this code?

The benchmark (and code) below suggests that you can save a significant amount of time by using

df.groupby(...).rolling() 

instead of

df.groupby(...)[col].apply(lambda x: x.rolling(...))

The main time-saving idea here is to try to apply vectorized functions (such as sum ) to the largest possible array (or DataFrame) at one time (with one function call) instead of many tiny function calls.

df.groupby(...).rolling().sum() calls sum on each (grouped) sub-DataFrame. It can compute the rolling sums for all the columns with one call. You could use df[sum_metrics_list+[key]].groupby(key).rolling().sum() to compute the rolling/sum on the sum_metrics_list columns.

In contrast, df.groupby(...)[col].apply(lambda x: x.rolling(...)) calls sum on a single column of each (grouped) sub-DataFrame. Since you have >3000 metrics you end up calling df.groupby(...)[col].rolling().sum() (or min or max ) 3000 times.

Of course, this pseudo-logic of counting the number of calls is only a heuristic which may guide you in the direction of faster code. The proof is in the pudding:

import collections
import timeit 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def make_df(nrows=100, ncols=3):
    seed = 2018
    np.random.seed(seed)
    df = pd.DataFrame(np.random.randint(10, size=(nrows, ncols)))
    df['athlete_id'] = np.random.randint(10, size=nrows)
    return df

def orig(df, key='athlete_id'):
    columns = list(df.columns.difference([key]))
    result = pd.DataFrame(index=df.index)
    for window in range(2, 4):
        for col in columns:
            colname = 'sum_col{}_winsize{}'.format(col, window)
            result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
                center=False, window=window, min_periods=1).sum())
            colname = 'min_col{}_winsize{}'.format(col, window)
            result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
                center=False, window=window, min_periods=1).min())
            colname = 'max_col{}_winsize{}'.format(col, window)
            result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
                center=False, window=window, min_periods=1).max())
    result = pd.concat([df, result], axis=1)
    return result

def alt(df, key='athlete_id'):
    """
    Call rolling on the whole DataFrame, not each column separately
    """
    columns = list(df.columns.difference([key]))
    result = [df]
    for window in range(2, 4):
        rolled = df.groupby(key, group_keys=False).rolling(
            center=False, window=window, min_periods=1)

        new_df = rolled.sum().drop(key, axis=1)
        new_df.columns = ['sum_col{}_winsize{}'.format(col, window) for col in columns]
        result.append(new_df)

        new_df = rolled.min().drop(key, axis=1)
        new_df.columns = ['min_col{}_winsize{}'.format(col, window) for col in columns]
        result.append(new_df)

        new_df = rolled.max().drop(key, axis=1)
        new_df.columns = ['max_col{}_winsize{}'.format(col, window) for col in columns]
        result.append(new_df)

    df = pd.concat(result, axis=1)
    return df

timing = collections.defaultdict(list)
ncols = [3, 10, 20, 50, 100]
for n in ncols:
    df = make_df(ncols=n)
    timing['orig'].append(timeit.timeit(
        'orig(df)',
        'from __main__ import orig, alt, df',
        number=10))
    timing['alt'].append(timeit.timeit(
        'alt(df)',
        'from __main__ import orig, alt, df',
        number=10))

plt.plot(ncols, timing['orig'], label='using groupby/apply (orig)')
plt.plot(ncols, timing['alt'], label='using groupby/rolling (alternative)')
plt.legend(loc='best')
plt.xlabel('number of columns')
plt.ylabel('seconds')
print(pd.DataFrame(timing, index=pd.Series(ncols, name='ncols')))
plt.show()

在此处输入图片说明 and yields these timeit benchmarks

            alt       orig
ncols                     
3      0.871695   0.996862
10     0.991617   3.307021
20     1.168522   6.602289
50     1.676441  16.558673
100    2.521121  33.261957

The speed advantage of alt compared to orig seems to increase as the number of columns increases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM