简体   繁体   中英

pandas groupby on rolling window

Sample data:

import random
import string
import pandas as pd

test1 = pd.DataFrame({
    'subID':[''.join(random.choice(string.ascii_letters[0:4]) for _ in range(3)) for n in range(100)],
    'ID':[''.join(random.choice(string.ascii_letters[5:9]) for _ in range(3)) for n in range(100)],
    'date':[pd.to_datetime(random.choice(['01-01-2018','02-01-2018','03-01-2018',
                                          '04-01-2018','05-01-2018','06-01-2018',
                                          '07-01-2018','08-01-2018','09-01-2018'])) for n in range(100)],
    'val':[random.choice([1,2,3,4]) for n in range(100)]
}).sort_values('date').drop_duplicates(subset=['subID','date'])

idxs = pd.period_range(min(test1.date), max(test1.date), freq='M')

test1['date'] = pd.to_datetime(test1.date, format='%m-%d-%Y').dt.to_period("M")

df = pd.DataFrame()
for name, group in test1.groupby('subID'):
    g_ = group.set_index('date').reindex(idxs).reset_index().rename(columns={'index': 'date'})
    g_['subID'] = g_.subID.bfill().ffill()
    g_['ID'] = g_.ID.bfill().ffill()
    g_['val'] = g_.val.fillna(0)
    df = df.append(g_).reset_index(drop=True)

Now on df I want to run a calculation (like np.std) on every rolling 3 month window across each ID . So:

for name, group in df.groupby('ID'):
    ...

then on each group I want the standard deviation of ALL the values across a rolling 3 month window. So, if within an ID group there are 3 subID groups, then each of those subID groups have their own set of dates and vals - how can I get the rolling standard deviation of all the values for every subID in that 3 month window, then save that and keep calculating for every 3 month window?

If the data look like:

        date subID   ID  val
389  2018-03   dca  fff  0.0
407  2018-03   dcc  fff  0.0
390  2018-04   dca  fff  1.0
408  2018-04   dcc  fff  0.0
391  2018-05   dca  fff  3.0
409  2018-05   dcc  fff  0.0
392  2018-06   dca  fff  0.0
410  2018-06   dcc  fff  2.0
393  2018-07   dca  fff  0.0
411  2018-07   dcc  fff  0.0
394  2018-08   dca  fff  3.0
412  2018-08   dcc  fff  0.0
413  2018-09   dcc  fff  4.0

Then the windows would be:

[2018-03, 2018-04, 2018-05] and the calculation would be: np.std(0,0,1,0,3,0)

[2018-04, 2018-05, 2018-06] and the calculation would be: np.std(1,0,3,0,0,2)

[2018-05, 2018-06, 2018-07] and the calculation would be: np.std(3,0,0,2,0,0)

and so on...

So ultimately the final dataset would be a standard deviation calculation for every month for each ID (except for the first two months - due to window size)

Try this snippet:

import numpy as np

df['month'] = df.date.dt.month # adding month column for simplicity

mdf = pd.DataFrame({'month':[1,2,3,4,5,6,7,8,9]}) # for zero filling

df = df.groupby('ID').apply(lambda x: x[['ID','month','val']].merge(mdf, on='month', how='right').fillna(
{'ID':x.ID.dropna().unique()[0], 'val':0})).reset_index(drop=True) # zero filling for each ID

df1 = df.groupby(['ID', 'month']).apply(lambda x: x.val.values).reset_index().rename({0:'val'}, axis=1) # Aggregating values for each ID and Month combination for further computation

def customrolling(x):
    '''Function for iterating over each group (i.e. ID) and returning dataframe containing column 'stdval' which is rolling std of last 3 months for given ID.'''
    stdval = []
    temp = pd.DataFrame(columns=['ID', 'month','stdval'])
    for i,m in enumerate(x.iterrows()):
        if i>=2:
            stdval.append(np.std(np.concatenate(x.iloc[i-2:i+1,:]['val'].values, axis=0))) # calculating std for last 3 months for given ID and month and storing it in list
        else:
            stdval.append(0)
    temp.ID = x.ID
    temp.month = x.month
    temp.stdval = stdval
    return temp

target_df = df1.groupby('ID').apply(lambda x: customrolling(x)).reset_index(drop=True)

This will give desired target_df :

 ID  month  stdval
0  fff      1     0.0
1  fff      2     0.0
2  fff      3     0.0
3  fff      4     0.0
4  fff      5     0.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM