I want to calculate a median over an irregular pandas series.
In particular, I want to calculate the median first based on the first X-days and later based on the following X-days.
I did code the following working example. In there, I generate two columns, one median_-2days
which lists the median based on the previous two days and one median_+2days
which does the same based on the next 2 days.
import numpy as np
import pandas as pd
def dummy_data():
idx = np.array([pd.Timestamp(year=2021, month=1, day=1),
pd.Timestamp(year=2021, month=1, day=2),
pd.Timestamp(year=2021, month=1, day=3),
pd.Timestamp(year=2021, month=1, day=5),
pd.Timestamp(year=2021, month=1, day=6),
pd.Timestamp(year=2021, month=1, day=8),
pd.Timestamp(year=2021, month=1, day=9),
pd.Timestamp(year=2021, month=1, day=10),
])
data = np.array([1, 2, 3, 5, 6, 8, 9, 10])
return pd.DataFrame(data, index=idx, columns=["l"])
def rolling_median_irregular(ds, left, right):
res = pd.Series(index=ds.index)
for t in ds.index:
val = ds.loc[(ds.index >= t - left) & (ds.index <= t + right)].median()
res.loc[t] = val
return res
if __name__ == "__main__":
df = dummy_data()
df["median_-2days"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=2), right=pd.Timedelta(days=0))
df["median_+2days"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=0), right=pd.Timedelta(days=2))
However, I feel like I'm reinventing the wheel. I would prefer to use the built-in rolling
function to have a more general approach where I can also use different functions (like .sum()
or .mean()
) and different window types.
Is it even possible to do this with the built-in functions or do I have to subclass a BaseIndexer
? If this is the case, how would this look like?
You can use rolling
with a window of 3D as you want to include the boundary with >=
and <=
. to do the left and right, you can reverse the series with [::-1]
so it is done with:
df["median_-2days_r"] = df.loc[:,'l'].rolling('3D').median()
df["median_+2days_r"] = df.loc[::-1, 'l'].rolling('3D').median()
print(df)
l median_-2days median_+2days median_-2days_r median_+2days_r
2021-01-01 1 1.0 2.0 1.0 2.0
2021-01-02 2 1.5 2.5 1.5 2.5
2021-01-03 3 2.0 4.0 2.0 4.0
2021-01-05 5 4.0 5.5 4.0 5.5
2021-01-06 6 5.5 7.0 5.5 7.0
2021-01-08 8 7.0 9.0 7.0 9.0
2021-01-09 9 8.5 9.5 8.5 9.5
2021-01-10 10 9.0 10.0 9.0 10.0
Edit : with the specification given about duplicates index and the size of data, you can try per dataframe
#sample data
np.random.seed(10)
df_ = pd.DataFrame(
index=np.tile(np.random.choice(
pd.date_range('2021-03-03','2021-03-04',freq='T'),
size=1000, replace=False),
5),
data={'l':range(5000)}
).sort_index()
#your method for reference
df_['median_func'] = rolling_median_irregular(df_["l"], left=pd.Timedelta(minutes=15),
right=pd.Timedelta(minutes=0))
#using concat
df_['median_concat'] = (
pd.concat([df_[['l']].set_index(df_.index+pd.Timedelta(minutes=i))
#here put -i for other direction
for i in range(0,16)])
.groupby(level=0)
.median()
)
So you get
print(df_.head(12))
l median_concat median_func
2021-03-03 00:01:00 3357 2357.0 2357.0
2021-03-03 00:01:00 357 2357.0 2357.0
2021-03-03 00:01:00 4357 2357.0 2357.0
2021-03-03 00:01:00 1357 2357.0 2357.0
2021-03-03 00:01:00 2357 2357.0 2357.0
2021-03-03 00:03:00 3903 2630.0 2630.0
2021-03-03 00:03:00 903 2630.0 2630.0
2021-03-03 00:03:00 2903 2630.0 2630.0
2021-03-03 00:03:00 1903 2630.0 2630.0
2021-03-03 00:03:00 4903 2630.0 2630.0
2021-03-03 00:06:00 2505 2505.0 2505.0
2021-03-03 00:06:00 3505 2505.0 2505.0
and for a timing, I get about 80x faster
%timeit rolling_median_irregular(df_["l"], left=pd.Timedelta(minutes=15), right=pd.Timedelta(minutes=0))
# 2.12 s ± 74.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat([df_[['l']].set_index(df_.index+pd.Timedelta(minutes=i)) for i in range(0,16)]).groupby(level=0).median()
#25.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
yes, you are reinventing the wheel.
You only need to specify the window= '2D'
argument in pd.rolling
to be able to roll over fixed time periods, like two days (as opposted to a fixed number of observations)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.