简体   繁体   中英

pandas rolling over irregular dataset with irregular window size

I want to calculate a median over an irregular pandas series.

In particular, I want to calculate the median first based on the first X-days and later based on the following X-days.

I did code the following working example. In there, I generate two columns, one median_-2days which lists the median based on the previous two days and one median_+2days which does the same based on the next 2 days.

import numpy as np
import pandas as pd


def dummy_data():
    idx = np.array([pd.Timestamp(year=2021, month=1, day=1),
                    pd.Timestamp(year=2021, month=1, day=2),
                    pd.Timestamp(year=2021, month=1, day=3),
                    pd.Timestamp(year=2021, month=1, day=5),
                    pd.Timestamp(year=2021, month=1, day=6),
                    pd.Timestamp(year=2021, month=1, day=8),
                    pd.Timestamp(year=2021, month=1, day=9),
                    pd.Timestamp(year=2021, month=1, day=10),
                    ])
    data = np.array([1, 2, 3, 5, 6, 8, 9, 10])
    return pd.DataFrame(data, index=idx, columns=["l"])


def rolling_median_irregular(ds, left, right):
    res = pd.Series(index=ds.index)
    for t in ds.index:
        val = ds.loc[(ds.index >= t - left) & (ds.index <= t + right)].median()
        res.loc[t] = val
    return res


if __name__ == "__main__":
    df = dummy_data()
    df["median_-2days"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=2), right=pd.Timedelta(days=0))
    df["median_+2days"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=0), right=pd.Timedelta(days=2))

However, I feel like I'm reinventing the wheel. I would prefer to use the built-in rolling function to have a more general approach where I can also use different functions (like .sum() or .mean() ) and different window types.

Is it even possible to do this with the built-in functions or do I have to subclass a BaseIndexer ? If this is the case, how would this look like?

You can use rolling with a window of 3D as you want to include the boundary with >= and <= . to do the left and right, you can reverse the series with [::-1] so it is done with:

df["median_-2days_r"] = df.loc[:,'l'].rolling('3D').median()
df["median_+2days_r"] = df.loc[::-1, 'l'].rolling('3D').median()
print(df)
             l  median_-2days  median_+2days  median_-2days_r  median_+2days_r
2021-01-01   1            1.0            2.0              1.0              2.0
2021-01-02   2            1.5            2.5              1.5              2.5
2021-01-03   3            2.0            4.0              2.0              4.0
2021-01-05   5            4.0            5.5              4.0              5.5
2021-01-06   6            5.5            7.0              5.5              7.0
2021-01-08   8            7.0            9.0              7.0              9.0
2021-01-09   9            8.5            9.5              8.5              9.5
2021-01-10  10            9.0           10.0              9.0             10.0

Edit : with the specification given about duplicates index and the size of data, you can try per dataframe

#sample data
np.random.seed(10)
df_ = pd.DataFrame(
        index=np.tile(np.random.choice(
                         pd.date_range('2021-03-03','2021-03-04',freq='T'),
                         size=1000, replace=False), 
                      5), 
        data={'l':range(5000)}
        ).sort_index()
#your method for reference
df_['median_func'] = rolling_median_irregular(df_["l"], left=pd.Timedelta(minutes=15), 
                                              right=pd.Timedelta(minutes=0))
#using concat
df_['median_concat'] = (
    pd.concat([df_[['l']].set_index(df_.index+pd.Timedelta(minutes=i)) 
                                              #here put -i for other direction
               for i in range(0,16)])
      .groupby(level=0)
      .median()
    )

So you get

print(df_.head(12))
                        l  median_concat  median_func
2021-03-03 00:01:00  3357         2357.0       2357.0
2021-03-03 00:01:00   357         2357.0       2357.0
2021-03-03 00:01:00  4357         2357.0       2357.0
2021-03-03 00:01:00  1357         2357.0       2357.0
2021-03-03 00:01:00  2357         2357.0       2357.0
2021-03-03 00:03:00  3903         2630.0       2630.0
2021-03-03 00:03:00   903         2630.0       2630.0
2021-03-03 00:03:00  2903         2630.0       2630.0
2021-03-03 00:03:00  1903         2630.0       2630.0
2021-03-03 00:03:00  4903         2630.0       2630.0
2021-03-03 00:06:00  2505         2505.0       2505.0
2021-03-03 00:06:00  3505         2505.0       2505.0

and for a timing, I get about 80x faster

%timeit rolling_median_irregular(df_["l"], left=pd.Timedelta(minutes=15), right=pd.Timedelta(minutes=0))
# 2.12 s ± 74.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat([df_[['l']].set_index(df_.index+pd.Timedelta(minutes=i)) for i in range(0,16)]).groupby(level=0).median()
#25.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

yes, you are reinventing the wheel.

You only need to specify the window= '2D' argument in pd.rolling to be able to roll over fixed time periods, like two days (as opposted to a fixed number of observations)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM