Sample data:
import random
import string
import pandas as pd
test1 = pd.DataFrame({
'subID':[''.join(random.choice(string.ascii_letters[0:4]) for _ in range(3)) for n in range(100)],
'ID':[''.join(random.choice(string.ascii_letters[5:9]) for _ in range(3)) for n in range(100)],
'date':[pd.to_datetime(random.choice(['01-01-2018','02-01-2018','03-01-2018',
'04-01-2018','05-01-2018','06-01-2018',
'07-01-2018','08-01-2018','09-01-2018'])) for n in range(100)],
'val':[random.choice([1,2,3,4]) for n in range(100)]
}).sort_values('date').drop_duplicates(subset=['subID','date'])
idxs = pd.period_range(min(test1.date), max(test1.date), freq='M')
test1['date'] = pd.to_datetime(test1.date, format='%m-%d-%Y').dt.to_period("M")
df = pd.DataFrame()
for name, group in test1.groupby('subID'):
g_ = group.set_index('date').reindex(idxs).reset_index().rename(columns={'index': 'date'})
g_['subID'] = g_.subID.bfill().ffill()
g_['ID'] = g_.ID.bfill().ffill()
g_['val'] = g_.val.fillna(0)
df = df.append(g_).reset_index(drop=True)
Now on df
I want to run a calculation (like np.std) on every rolling 3 month window across each ID
. So:
for name, group in df.groupby('ID'):
...
then on each group I want the standard deviation of ALL the values across a rolling 3 month window. So, if within an ID
group there are 3 subID
groups, then each of those subID
groups have their own set of dates and vals - how can I get the rolling standard deviation of all the values for every subID
in that 3 month window, then save that and keep calculating for every 3 month window?
If the data look like:
date subID ID val
389 2018-03 dca fff 0.0
407 2018-03 dcc fff 0.0
390 2018-04 dca fff 1.0
408 2018-04 dcc fff 0.0
391 2018-05 dca fff 3.0
409 2018-05 dcc fff 0.0
392 2018-06 dca fff 0.0
410 2018-06 dcc fff 2.0
393 2018-07 dca fff 0.0
411 2018-07 dcc fff 0.0
394 2018-08 dca fff 3.0
412 2018-08 dcc fff 0.0
413 2018-09 dcc fff 4.0
Then the windows would be:
[2018-03, 2018-04, 2018-05]
and the calculation would be: np.std(0,0,1,0,3,0)
[2018-04, 2018-05, 2018-06]
and the calculation would be: np.std(1,0,3,0,0,2)
[2018-05, 2018-06, 2018-07]
and the calculation would be: np.std(3,0,0,2,0,0)
and so on...
So ultimately the final dataset would be a standard deviation calculation for every month for each ID
(except for the first two months - due to window size)
Try this snippet:
import numpy as np
df['month'] = df.date.dt.month # adding month column for simplicity
mdf = pd.DataFrame({'month':[1,2,3,4,5,6,7,8,9]}) # for zero filling
df = df.groupby('ID').apply(lambda x: x[['ID','month','val']].merge(mdf, on='month', how='right').fillna(
{'ID':x.ID.dropna().unique()[0], 'val':0})).reset_index(drop=True) # zero filling for each ID
df1 = df.groupby(['ID', 'month']).apply(lambda x: x.val.values).reset_index().rename({0:'val'}, axis=1) # Aggregating values for each ID and Month combination for further computation
def customrolling(x):
'''Function for iterating over each group (i.e. ID) and returning dataframe containing column 'stdval' which is rolling std of last 3 months for given ID.'''
stdval = []
temp = pd.DataFrame(columns=['ID', 'month','stdval'])
for i,m in enumerate(x.iterrows()):
if i>=2:
stdval.append(np.std(np.concatenate(x.iloc[i-2:i+1,:]['val'].values, axis=0))) # calculating std for last 3 months for given ID and month and storing it in list
else:
stdval.append(0)
temp.ID = x.ID
temp.month = x.month
temp.stdval = stdval
return temp
target_df = df1.groupby('ID').apply(lambda x: customrolling(x)).reset_index(drop=True)
This will give desired target_df
:
ID month stdval
0 fff 1 0.0
1 fff 2 0.0
2 fff 3 0.0
3 fff 4 0.0
4 fff 5 0.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.