简体   繁体   中英

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Given a dataframe of timestamp data, I would like to compute the median of certain variable of past 4-6 days. Median of past 1-3 days can be computed by pd.pandas.DataFrame.rolling , but I couldn't find how to use rolling to compute the median of past 4-6 days.

import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.

              timestamp       var
0   2011-01-01 00:00:00  1.624345
1   2011-01-01 06:00:00 -0.611756
2   2011-01-01 12:00:00 -0.528172
3   2011-01-01 18:00:00 -1.072969
4   2011-01-02 00:00:00  0.865408
5   2011-01-02 06:00:00 -2.301539
6   2011-01-02 12:00:00  1.744812
7   2011-01-02 18:00:00 -0.761207
8   2011-01-03 00:00:00  0.319039
9   2011-01-03 06:00:00 -0.249370
10  2011-01-03 12:00:00  1.462108

Desired output:

              timestamp       var  past4d-6d_var_median
0   2011-01-01 00:00:00  1.624345                   NaN # no data in past 4-6 days
1   2011-01-01 06:00:00 -0.611756                   NaN # no data in past 4-6 days
2   2011-01-01 12:00:00 -0.528172                   NaN # no data in past 4-6 days
3   2011-01-01 18:00:00 -1.072969                   NaN # no data in past 4-6 days
4   2011-01-02 00:00:00  0.865408                   NaN # no data in past 4-6 days
5   2011-01-02 06:00:00 -2.301539                   NaN # no data in past 4-6 days
6   2011-01-02 12:00:00  1.744812                   NaN # no data in past 4-6 days
7   2011-01-02 18:00:00 -0.761207                   NaN # no data in past 4-6 days
8   2011-01-03 00:00:00  0.319039                   NaN # no data in past 4-6 days
9   2011-01-03 06:00:00 -0.249370                   NaN # no data in past 4-6 days
10  2011-01-03 12:00:00  1.462108                   NaN # no data in past 4-6 days
11  2011-01-03 18:00:00 -2.060141                   NaN # no data in past 4-6 days
12  2011-01-04 00:00:00 -0.322417                   NaN # no data in past 4-6 days
13  2011-01-04 06:00:00 -0.384054                   NaN # no data in past 4-6 days
14  2011-01-04 12:00:00  1.133769                   NaN # no data in past 4-6 days
15  2011-01-04 18:00:00 -1.099891                   NaN # no data in past 4-6 days
16  2011-01-05 00:00:00 -0.172428                   NaN # only 4 data in past 4-6 days
17  2011-01-05 06:00:00 -0.877858             -0.528172
18  2011-01-05 12:00:00  0.042214             -0.569964
19  2011-01-05 18:00:00  0.582815             -0.528172
20  2011-01-06 00:00:00 -1.100619             -0.569964
21  2011-01-06 06:00:00  1.144724             -0.528172
22  2011-01-06 12:00:00  0.901591             -0.388771
23  2011-01-06 18:00:00  0.502494             -0.249370

My current code:

def findPastVar2(df, var='var' ,window=3, method='median'):
    # window= # of past days    
    for i in xrange(len(df)):
        pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
        if pastVar2.shape[0]>=5: # At least 5 data points
            if method == 'median':
                df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
    return(df)

Current speed:

In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop

I edited the post so that I can clearly show my expected output of at least 5 data points. I've set the random seed so that everyone should be able to get the same input and show the same output. As far as I know simple rolling and shift does not work for the case of multiple data in the same day.

here we go:

df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)

df['var'] 
Out[55]: 
timestamp
2011-01-01 00:00:00         NaN
2011-01-01 06:00:00         NaN
2011-01-01 12:00:00         NaN
2011-01-01 18:00:00         NaN
2011-01-02 00:00:00         NaN
2011-01-02 06:00:00         NaN
2011-01-02 12:00:00         NaN
2011-01-02 18:00:00         NaN
2011-01-03 00:00:00         NaN
2011-01-03 06:00:00         NaN
2011-01-03 12:00:00         NaN
2011-01-03 18:00:00         NaN
2011-01-04 00:00:00         NaN
2011-01-04 06:00:00         NaN
2011-01-04 12:00:00         NaN
2011-01-04 18:00:00         NaN
2011-01-05 00:00:00         NaN
2011-01-05 06:00:00   -0.528172
2011-01-05 12:00:00   -0.569964
2011-01-05 18:00:00   -0.528172
2011-01-06 00:00:00   -0.569964
2011-01-06 06:00:00   -0.528172
2011-01-06 12:00:00   -0.569964
2011-01-06 18:00:00   -0.528172
2011-01-07 00:00:00   -0.388771
2011-01-07 06:00:00   -0.249370
2011-01-07 12:00:00   -0.388771

The way this is setup is for each row, and as an irregular timeseries, it will have different widths thus requiring an iterative approach like you have started. But, if we make the index the timeseries

# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))

in this case, I chose an interval every 12hrs, but could be whatever is available or irregular. Using a modified function with a window for the median, along with an offset (here, positive Delta is looking backwards), gives you the flexibility you wanted:

def GetMedian(df,var='var',window='2D',Delta='3D'):
    for Ti in df.index:
        Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
                (df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
        df.loc[Ti,'Medians']=Vals[var].median()
    return df

This runs substantially faster:

%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The min_period should be 2 instead of 5 because you should not count window size in. (5 - 3 = 2)

import pandas as pd
import numpy as np
import datetime
np.random.seed(1)  # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

def first():
    df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
    return df

%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop

My first try didn't use shift() , but then I saw Noobie's answer .

I made the following one with shift() , which is much faster than previous one.

def test():
    df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
    return df

%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop

The second one is around 4 times as fast as the first one.

These two function creates the same result, which looks like this:

df2 = test()
df2
                  timestamp       var   past4d-6d_var_median
    0   2011-01-01 00:00:00  1.624345                    NaN
    1   2011-01-02 00:00:00 -0.611756                    NaN
    2   2011-01-03 00:00:00 -0.528172                    NaN
    3   2011-01-04 00:00:00 -1.072969                    NaN
    4   2011-01-05 00:00:00  0.865408               0.506294
    5   2011-01-06 00:00:00 -2.301539              -0.528172
    6   2011-01-07 00:00:00  1.744812              -0.611756
    ...         ...            ...             ...
    93  2011-04-04 00:00:00 -0.638730               1.129484
    94  2011-04-05 00:00:00  0.423494               1.129484
    95  2011-04-06 00:00:00  0.077340               0.185156
    96  2011-04-07 00:00:00 -0.343854              -0.375285
    97  2011-04-08 00:00:00  0.043597              -0.375285
    98  2011-04-09 00:00:00 -0.620001               0.077340
    99  2011-04-10 00:00:00  0.698032               0.077340

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM