Given a dataframe of timestamp data, I would like to compute the median of certain variable of past 4-6 days. Median of past 1-3 days can be computed by pd.pandas.DataFrame.rolling
, but I couldn't find how to use rolling to compute the median of past 4-6 days.
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
timestamp var
0 2011-01-01 00:00:00 1.624345
1 2011-01-01 06:00:00 -0.611756
2 2011-01-01 12:00:00 -0.528172
3 2011-01-01 18:00:00 -1.072969
4 2011-01-02 00:00:00 0.865408
5 2011-01-02 06:00:00 -2.301539
6 2011-01-02 12:00:00 1.744812
7 2011-01-02 18:00:00 -0.761207
8 2011-01-03 00:00:00 0.319039
9 2011-01-03 06:00:00 -0.249370
10 2011-01-03 12:00:00 1.462108
Desired output:
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN # no data in past 4-6 days
1 2011-01-01 06:00:00 -0.611756 NaN # no data in past 4-6 days
2 2011-01-01 12:00:00 -0.528172 NaN # no data in past 4-6 days
3 2011-01-01 18:00:00 -1.072969 NaN # no data in past 4-6 days
4 2011-01-02 00:00:00 0.865408 NaN # no data in past 4-6 days
5 2011-01-02 06:00:00 -2.301539 NaN # no data in past 4-6 days
6 2011-01-02 12:00:00 1.744812 NaN # no data in past 4-6 days
7 2011-01-02 18:00:00 -0.761207 NaN # no data in past 4-6 days
8 2011-01-03 00:00:00 0.319039 NaN # no data in past 4-6 days
9 2011-01-03 06:00:00 -0.249370 NaN # no data in past 4-6 days
10 2011-01-03 12:00:00 1.462108 NaN # no data in past 4-6 days
11 2011-01-03 18:00:00 -2.060141 NaN # no data in past 4-6 days
12 2011-01-04 00:00:00 -0.322417 NaN # no data in past 4-6 days
13 2011-01-04 06:00:00 -0.384054 NaN # no data in past 4-6 days
14 2011-01-04 12:00:00 1.133769 NaN # no data in past 4-6 days
15 2011-01-04 18:00:00 -1.099891 NaN # no data in past 4-6 days
16 2011-01-05 00:00:00 -0.172428 NaN # only 4 data in past 4-6 days
17 2011-01-05 06:00:00 -0.877858 -0.528172
18 2011-01-05 12:00:00 0.042214 -0.569964
19 2011-01-05 18:00:00 0.582815 -0.528172
20 2011-01-06 00:00:00 -1.100619 -0.569964
21 2011-01-06 06:00:00 1.144724 -0.528172
22 2011-01-06 12:00:00 0.901591 -0.388771
23 2011-01-06 18:00:00 0.502494 -0.249370
My current code:
def findPastVar2(df, var='var' ,window=3, method='median'):
# window= # of past days
for i in xrange(len(df)):
pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
if pastVar2.shape[0]>=5: # At least 5 data points
if method == 'median':
df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
return(df)
Current speed:
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop
I edited the post so that I can clearly show my expected output of at least 5 data points. I've set the random seed so that everyone should be able to get the same input and show the same output. As far as I know simple rolling
and shift
does not work for the case of multiple data in the same day.
here we go:
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)
df['var']
Out[55]:
timestamp
2011-01-01 00:00:00 NaN
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 NaN
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 NaN
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 NaN
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 NaN
2011-01-05 06:00:00 -0.528172
2011-01-05 12:00:00 -0.569964
2011-01-05 18:00:00 -0.528172
2011-01-06 00:00:00 -0.569964
2011-01-06 06:00:00 -0.528172
2011-01-06 12:00:00 -0.569964
2011-01-06 18:00:00 -0.528172
2011-01-07 00:00:00 -0.388771
2011-01-07 06:00:00 -0.249370
2011-01-07 12:00:00 -0.388771
The way this is setup is for each row, and as an irregular timeseries, it will have different widths thus requiring an iterative approach like you have started. But, if we make the index the timeseries
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))
in this case, I chose an interval every 12hrs, but could be whatever is available or irregular. Using a modified function with a window for the median, along with an offset (here, positive Delta
is looking backwards), gives you the flexibility you wanted:
def GetMedian(df,var='var',window='2D',Delta='3D'):
for Ti in df.index:
Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
(df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
df.loc[Ti,'Medians']=Vals[var].median()
return df
This runs substantially faster:
%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The min_period should be 2 instead of 5 because you should not count window size in. (5 - 3 = 2)
import pandas as pd
import numpy as np
import datetime
np.random.seed(1) # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
def first():
df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
return df
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop
My first try didn't use shift()
, but then I saw Noobie's answer .
I made the following one with shift()
, which is much faster than previous one.
def test():
df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
return df
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop
The second one is around 4 times as fast as the first one.
These two function creates the same result, which looks like this:
df2 = test()
df2
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN
1 2011-01-02 00:00:00 -0.611756 NaN
2 2011-01-03 00:00:00 -0.528172 NaN
3 2011-01-04 00:00:00 -1.072969 NaN
4 2011-01-05 00:00:00 0.865408 0.506294
5 2011-01-06 00:00:00 -2.301539 -0.528172
6 2011-01-07 00:00:00 1.744812 -0.611756
... ... ... ...
93 2011-04-04 00:00:00 -0.638730 1.129484
94 2011-04-05 00:00:00 0.423494 1.129484
95 2011-04-06 00:00:00 0.077340 0.185156
96 2011-04-07 00:00:00 -0.343854 -0.375285
97 2011-04-08 00:00:00 0.043597 -0.375285
98 2011-04-09 00:00:00 -0.620001 0.077340
99 2011-04-10 00:00:00 0.698032 0.077340
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.