简体   繁体   中英

Apply function to rolling window on multi-indexed DataFrame, get only one aggregated result per date and category

I have a multi-index based on dates and a category, which looks like this:

                  D
Date       G            
2004-01-01 A      1
           A      2
           A      3
           A      4
           B    101
           B    102      
           B    103 
           C    201 
           C    202 
           C    203 
           C    204 
           C    205 
2004-01-02 A      5 
           A      6
           A      7
           A      8
           B    104 
           B    105 
           B    106 
           C    206 
           C    207 
           C    208 
           C    209 
           C    210 
...

Now I want to apply a rolling window to aggregate my data, but group the aggregation by the second level index. With a function like min() and a timewindow of three days (no minimal periods), this would result in something like this:

                  D
Date       G            
2004-01-01 A      1
           B    101
           C    201 
2004-01-02 A      1 
           B    101 
           C    201 
2004-01-03 A      1
           B    101
           C    201
2004-01-04 A      5 
           B    104 
           C    206
2004-01-05 A      9
           B    107
           C    211 
...

I have already tried the resampling method, but I couldn't get overlapping timewindows with this method.

This is what I already have:

# result dataframe with initiated index
df_agg = pd.DataFrame(index=df.index.unique(), columns=['D'])

# get all unique categories in df
groups = df.reset_index(level='G')['G'].unique()

# iterate over categories in G
for group in groups:
    df_group = df.iloc[df.index.get_level_values('G')==group]

    df_group_agg = pd.DataFrame()
    df_group_agg['D'] = df_group.reset_index(level='G')\
        .groupby('G')['D'].rolling('3d').apply(np.min)

    # swap index level because of groupby
    df_group_agg = df_group_agg.swaplevel(0,1)

    # write data into correct place
    df_agg.loc[df.index.get_level_values('G')==group, 'D'] = df_group_agg['D']

But this leaves me with something like this:

                  D
Date       G            
2004-01-01 A      1
           A      1
           A      1
           A      1
           B    101
           B    101      
           B    101 
           C    201 
           C    201 
           C    201 
           C    201 
           C    201 
2004-01-02 A      1 
           A      1
           A      1
           A      1
           B    101 
           B    101 
           B    101 
           C    201 
           C    201 
           C    201 
           C    201 
           C    211 
...

I could try to remove all the unnecessary rows, but this seems inefficient to me, as the aggregation function has to be calculated much more often than needed.

Is there are more elegant way to achieve my goal? Do I have to iterate manually?

Edit:

Here is some code to generate the described example:

from datetime import datetime
import numpy as np
import pandas as pd

dates = pd.date_range(start=datetime(2004, 1, 1),
                      end=datetime(2004, 3, 1),
                      freq='D')
df = pd.DataFrame(data={'Date': np.repeat(dates,4+3+5),
                        'G': (['A']*4+['B']*3+['C']*5)*len(dates)} )
df['D'] = 0
df.loc[df['G']=='A','D'] = range(1,len(df.loc[df['G']=='A','D'])+1)
df.loc[df['G']=='B','D'] = range(101,len(df.loc[df['G']=='B','D'])+101)
df.loc[df['G']=='C','D'] = range(201,len(df.loc[df['G']=='C','D'])+201)
df = df.set_index(['Date','G'])

I managed to find a working solution, looks much bettern than the one before. Nevertheless, I'm very interested in alternatives which would utilize more of existing pandas methods instead of these two for loops.

# window width in days
window_width = 3

# result dataframe with initiated index
df_agg = pd.DataFrame(index=df.index.unique(), columns=['D'])

# get all unique categories in df
groups = df.reset_index(level='G')['G'].unique()
dates = df.reset_index(level='Date')['Date'].unique()
dates = np.pad(dates, window_width-1, mode='edge')

for group in groups:
    for day in range(len(dates)-window_width-1):
        rolling_window = np.roll(dates,-day)[:window_width]
        rolling_data = df.loc[(rolling_window,group), 'D']
        df_agg.loc[(rolling_window[-1],group), 'D'] = np.min(rolling_data)

A problem with this solution is that the window_width variable depends on the actual time-intervals in the existing data. Here is an alternative which will really take a window in days. Unfortunately, the first few rows won't get aggregated this way, but this is no problem for me. Unfortunately it seems to be very slow..

# window width in days
window_width = 3

# result dataframe with initiated index
df_agg = pd.DataFrame(index=df.index.unique(), columns=['D'])
df_agg.sort_index(inplace=True)

# get all unique categories in df
groups = df.reset_index(level='G')['G'].unique()
dates = df.reset_index(level='Date')['Date'].unique()

for group in groups:
    for day in range(len(dates)):
        rolling_dates = np.roll(dates, -day)
        window_start = rolling_dates[0]
        window_end = window_start + np.timedelta64(window_width, 'D')

        # window includes start, but excludes end
        window = [d for d in dates if (d >= window_start) & (d < window_end)]
        window.sort() # in place                                                                                                                                                                                                      

        # get data and aggregate
        rolling_data = df.loc[(window[0],group):(window[-1],group), 'D']
        df_agg.loc[(window[-1],group), 'D'] = np.min(rolling_data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM