简体   繁体   English

具有每月偏移量的 pandas rolling() 函数

[英]pandas rolling() function with monthly offset

I'm trying to use the rolling() function on a pandas data frame with monthly data.我正在尝试在具有每月数据的熊猫数据框中使用 rolling() 函数。 However, I dropped some NaN values, so now there are some gaps in my time series.但是,我删除了一些 NaN 值,所以现在我的时间序列中有一些差距。 Therefore, the basic window parameter gives a misleading answer since it just looks at the previous observation:因此,基本窗口参数给出了一个误导性的答案,因为它只查看了之前的观察结果:

import pandas as pd
import numpy as np
import random
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()
dft['basic'] = dft['value'].rolling(2).sum()

See, for example the 2017-08-31 entry, which sums 3.0 and 9.0, but the previous entry is 2017-03-31.例如,请参见 2017-08-31 条目,它是 3.0 和 9.0 的总和,但前一个条目是 2017-03-31。

In [57]: dft.tail()
Out[57]:
            value  basic
2017-02-28    8.0   12.0
2017-03-31    3.0   11.0
2017-08-31    9.0   12.0
2017-10-31    7.0   16.0
2017-11-30    7.0   14.0

The natural solution (I thought) is to use a '2M' offset, but it gives an error:自然的解决方案(我认为)是使用“2M”偏移量,但它给出了一个错误:

In [58]: dft['basic2M'] = dft['value'].rolling('2M').sum()
...<output omitted>...
ValueError: <2 * MonthEnds> is a non-fixed frequency

If I move the Daily offset, I can get it to work, but this seems like an odd workaround:如果我移动每日偏移量,我可以让它工作,但这似乎是一个奇怪的解决方法:

In [59]: dft['basic32D'] = dft['value'].rolling('32D', min_periods=2).sum()

In [61]: dft.tail()
Out[61]:
            value  basic  basic32D
2017-02-28    8.0   12.0      12.0
2017-03-31    3.0   11.0      11.0
2017-08-31    9.0   12.0       NaN
2017-10-31    7.0   16.0       NaN
2017-11-30    7.0   14.0      14.0

I also tried converting to a PeriodIndex:我还尝试转换为 PeriodIndex:

dfp = dft.to_period(freq='M')

but this gives the same error:但这给出了同样的错误:

dfp['basic2M'] = dfp['value'].rolling('2M').sum()

and this is very unexpected:这是非常出乎意料的:

dfp['basic32Dp'] = dfp['value'].rolling('32D', min_periods=2).sum()
In [68]: dfp
Out[68]:
         value  basic  basic32D  basic32Dp
2016-02    9.0    NaN       NaN        NaN
2016-03    3.0   12.0      12.0       12.0
2016-04    7.0   10.0      10.0       19.0
2016-05    3.0   10.0      10.0       22.0
2016-06    4.0    7.0       7.0       26.0
2016-07    7.0   11.0      11.0       33.0
2016-08    3.0   10.0      10.0       36.0
2016-09    9.0   12.0      12.0       45.0
2016-11    5.0   14.0       NaN       50.0
2017-01    4.0    9.0       NaN       54.0
2017-02    8.0   12.0      12.0       62.0
2017-03    3.0   11.0      11.0       65.0
2017-08    9.0   12.0       NaN       74.0
2017-10    7.0   16.0       NaN       81.0
2017-11    7.0   14.0      14.0       88.0

The '32D' offset with the 'M' period index seems to be treated as '32M' perhaps? 'M' 周期索引的'32D' 偏移量似乎被视为'32M'? It appears to just be an expanding sum for the entire series.它似乎只是整个系列的一个扩展总和。

Perhaps I'm misunderstanding how to use offsets?也许我误解了如何使用偏移量? Obviously, I could solve this by keeping the NaN in the original value column and just use the window parameter, but offsets seem quite useful.显然,我可以通过将 NaN 保留在原始value列中并仅使用 window 参数来解决此问题,但偏移量似乎非常有用。

For what its worth, if I generate Hourly data with a DateTimeIndex, things seem to work as expected (ie a '2D' offset with data every 12 hours gives the correct answer across missing rows).对于它的价值,如果我使用 DateTimeIndex 生成每小时数据,事情似乎按预期工作(即每 12 小时数据的“2D”偏移量会在缺失的行中给出正确的答案)。

Here is a function that gives you the rolling sum of a specified number of months.这是一个函数,可以为您提供指定月数的滚动总和。 You did not provide variable 'dt' in your code above so I just created a list of datetimes (code included).您没有在上面的代码中提供变量“dt”,所以我只是创建了一个日期时间列表(包括代码)。

from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
import numpy as np
import random

def date_range(start_date, end_date, increment, period):
    result = []
    nxt = start_date
    delta = relativedelta(**{period:increment})
    while nxt <= end_date:
        result.append(nxt)
        nxt += delta
    return result

def MonthRollSum(df, offset, sumColumn):
    #must have DateTimeIndex
    df2 = df.copy()
    df2.index = df2.index + pd.DateOffset(days = -offset)
    return df2.groupby([df2.index.year, df2.index.month])[sumColumn].sum()

# added this part to generate the dt list for 8hour interval for 1000 days
start_date = datetime.now()
end_date = start_date + relativedelta(days=1000)
end_date = end_date.replace(hour=19, minute=0, second=0, microsecond=0)
dt = date_range(start_date, end_date, 8, 'hours')

# the following was given by the questioner
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()

# Call the solution function
dft = MonthRollSum(dft, 2, 'value')
dft

The results many vary because the initial list of value is randomly generated:由于初始值列表是随机生成的,因此结果会有所不同:

2021  2     290.0
      3     379.0
      4     414.0
      5     368.0
      6     325.0
      7     405.0
      8     425.0
      9     380.0
      10    393.0
      11    370.0
      12    419.0
2022  1     377.0
      2     275.0
      3     334.0
      4     350.0
      5     395.0
      6     376.0
      7     420.0
      8     419.0
      9     359.0
      10    328.0
      11    394.0
      12    345.0
2023  1     381.0
      2     335.0
      3     352.0
      4     355.0
      5     376.0
      6     350.0
      7     401.0
      8     443.0
      9     394.0
      10    394.0

This worked for me, using 30D instead of 1M这对我有用,使用30D而不是1M

df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM