简体   繁体   中英

1 Year Rolling mean pandas on column date

I would like to compute the 1-year rolling average for each row in this Dataframe test :

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2018-03-12  2.854749e-06
2316    7034    2018-03-09  3.907458e-06
2317    7034    2018-03-08  1.662412e-06
2318    7034    2018-03-07  1.346433e-06
2319    7034    2018-03-06  8.731700e-06
2320    7034    2018-03-05  7.145597e-06
2321    7034    2018-03-02  4.893283e-06
...

For example, I would need to calculate:

  • mean of variation of id 7034 between 2018-03-14 and 2017-08-14
  • mean of variation of id 7034 between 2018-03-13 and 2017-08-13
  • etc.

I tried:

test.groupby(['id','date'])['variation'].rolling(window=1,freq='Y',on='date').mean()

but I got the error message:

ValueError: invalid on specified as date, must be a column (if DataFrame) or None

How can I use the pandas rolling() function in this case?


[EDIT 1] [thanks to Sacul]

I tested:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='Y').mean()['variation']

But freq='Y' doesn't work (I got: ValueError: Invalid frequency: Y ) Then I used window = 365, freq = 'D' .

But there is another issue: because there are never 365 consecutive dates for each combined id-date , the result is always empty. Even if there missing dates, I would like to ignore them and consider all dates between the current date and the (current date - 365) to compute the rolling mean. For instance, imagine I have:

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2017-03-13  2.854749e-06

Then,

  • for 7034 2018-03-14: I would like to compute MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)
  • for 7034 2018-03-13: I would like to compute also MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)

How can I do that?


[EDIT 2]

Finally I used the formula below to calculate rolling median, averages and standard deviation on 1 Year by ignoring missing values:

pd.rolling_median(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_mean(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_std(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

I believe this should work for you:

# First make sure that `date` is a datetime object:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='A').mean()['variation']

using pd.DataFrame.rolling with datetime works well when the date is the index, which is why I used df.set_index('date') (as can be seen in one of the documentation's examples )

I can't really test if it works on the year's average on your example dataframe, as there is only one year and only one ID, but it should work.

Arguably Better Solution:

[EDIT] As pointed out by Mihai-Andrei Dinculescu, freq is now a deprecated argument. Here is an alternative (and probably more future-proof) way to do what you're looking for:

df.set_index('date').groupby('id')['variation'].resample('A').mean()

You can take a look at the resample documentation for more details on how this works, and this link regarding the frequency arguments.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM