简体   繁体   中英

Pandas DataFrame MultiIndex groupby rolling operation with missing dates

I have a dataframe which has a MultiIndex where the last column of the index is a date. I am trying to make a rolling operation on the columns with a specific frequency. As I understand it, the usual pandas approach if I had a TimeIndex would be to call the rolling function with a string of the frequency (for example '2D' if I wanted the window to be two days). Yet another approach suggested is to resample the TimeIndex and then apply rolling function with integer 2. Essentially what I want to be able to do is group by all the columns except for the last one and then tell the rolling column to use the last column for timedelta-specific rolling. Below is an example to demonstrate this:

from datetime import datetime
import pandas as pd
multi_index = pd.MultiIndex.from_tuples([
    ("A", datetime(2017, 1, 1)), 
    ("A", datetime(2017, 1, 2)), 
    ("A", datetime(2017, 1, 3)), 
    ("A", datetime(2017, 1, 4)),
    ("B", datetime(2017, 1, 1)),
    ("B", datetime(2017, 1, 3)),
    ("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 1, 1, 1, 1, 1, 1]})
display(df)
df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]).sum().rolling(2).sum

The above code does not create a row for (B, datetime(2017, 1, 2)) and so the rolling sums will be all two.

One ugly way to get around this, which really only works if there is a group which has all the days is to unstack, fillna and stack before rolling:

df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]
).sum().unstack().fillna(0).stack().rolling(2).sum()

Needless to say this is an ugly hack, slow and error-prone. Is there a nice way achieve what I need here without extensive manipulation? Ideally some way to tell the grouper to take the timestamp column or fill missing values itself?

You can use groupby + resample + fillna - need version pandas 0.19.0 :

multi_index = pd.MultiIndex.from_tuples([
    ("A", datetime(2017, 1, 1)), 
    ("A", datetime(2017, 1, 2)), 
    ("A", datetime(2017, 1, 3)), 
    ("A", datetime(2017, 1, 4)),
    ("B", datetime(2017, 1, 1)),
    ("B", datetime(2017, 1, 3)),
    ("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 2, 3, 4, 1, 2, 3]})
print (df)
              colA
A 2017-01-01     1
  2017-01-02     2
  2017-01-03     3
  2017-01-04     4
B 2017-01-01     1
  2017-01-03     2
  2017-01-04     3

b = df.groupby(level=0).resample('1D', level=1).sum().fillna(0).rolling(2).sum()
print (b)
              colA
A 2017-01-01   NaN
  2017-01-02   3.0
  2017-01-03   5.0
  2017-01-04   7.0
B 2017-01-01   5.0
  2017-01-02   1.0
  2017-01-03   2.0
  2017-01-04   5.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM