[英]Pandas groupby rolling mean with custom window size
For a Pandas DataFrame I'm trying to get a grouped by rolling mean with a changeable window size specified on each row that's relative to a date time index. 对于Pandas DataFrame,我试图通过滚动均值进行分组,并在相对于日期时间索引的每一行上指定可更改的窗口大小。
For the following df
of weekly data: 对于以下每周数据df
:
| week_start_date | material | location | quantity | window_size |
|-----------------|----------|----------|----------|-------------|
| 2019-01-28 | C | A | 870 | 1 |
| 2019-02-04 | C | A | 920 | 3 |
| 2019-02-18 | C | A | 120 | 1 |
| 2019-02-25 | C | A | 120 | 2 |
| 2019-03-04 | C | A | 120 | 1 |
| 2018-12-31 | D | A | 1200 | 8 |
| 2019-01-21 | D | A | 720 | 8 |
| 2019-01-28 | D | A | 480 | 8 |
| 2019-02-04 | D | A | 600 | 8 |
| 2019-02-11 | D | A | 720 | 8 |
| 2019-02-18 | D | A | 80 | 8 |
| 2019-02-25 | D | A | 600 | 8 |
| 2019-03-04 | D | A | 1200 | 8 |
| 2019-01-14 | E | B | 150 | 1 |
| 2019-01-28 | E | B | 1416 | 1 |
| 2019-02-04 | F | B | 1164 | 1 |
| 2019-01-28 | G | B | 11520 | 8 |
The window needs to be relative to the actual date set in week_start_date
, rather than treating it like an integer index. 窗口需要相对于week_start_date
设置的实际日期,而不是像整数索引一样对待。
It needs to be grouped by material
and location
. 需要根据material
和location
进行分组。
The rolling mean is for column quantity
. 滚动平均值用于列quantity
。
The window size needs to vary/change based on the value in the window_size
column. 窗口大小需要根据window_size
列中的值进行更改。 This value changes over time - it represents the number of weeks back in time that quantity needs to be aggregated for. 该值随时间变化-表示需要汇总数量的时间倒退的周数。
When a row isn't available, the mean should assume that value is 0, ie: when a week-dated row isn't available mean(null, null, null, 1000) = 1000
but it should actually: mean(0,0,0,1000)=250 However - this should only apply after the first observation has been measured. 当某行不可用时,均值应假定值为0,即:当某周日期的行不可用时, mean(null, null, null, 1000) = 1000
但实际上应为:mean(0, 0,0,1000)= 250但是-这仅应在测量到第一个观测值之后才适用。
I can get a static window of 8 weeks (56 days) using the following: 我可以使用以下方法获得8周(56天)的静态窗口:
df.set_index('week_start_date').groupby(['material', 'location'])['quantity'].rolling('56D', min_periods=1).mean()
I've explored use of expanding but haven't been successful. 我已经探索了使用扩展的方法,但没有成功。
How can the window size be set relative to each row it reads? 如何相对于读取的每一行设置窗口大小?
# Example Data
df = pd.DataFrame({'week_start_date': ['2019-01-28','2019-02-04','2019-02-18','2019-02-25','2019-03-04','2018-12-31','2019-01-21','2019-01-28','2019-02-04','2019-02-11','2019-02-18','2019-02-25','2019-03-04','2019-01-14','2019-01-28','2019-02-04','2019-01-28'],
'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E','E','F','G'],
'location': ['A','A','A','A','A','A','A','A','A','A','A','A','A','B','B','B','B'],
'quantity': ['870','920','120','120','120','1200','720','480','600','720','80','600','1200','150','1416','1164','11520'],
'min_of_pdt_or_8_weeks': ['1','3','1','2','1','8','8','8','8','8','8','8','8','1','3','1','8']})
# Fix formats
df['week_start_date'] = pd.to_datetime(df['week_start_date'])
df['actual_week_qty'] = df['quantity'].astype(float)
| material | location | week_start_date | quantity |
| C | A | 2019-01-28 | 870 |
| C | A | 2019-04-02 | 306.6667 |
| C | A | 2019-02-18 | 520 |
| C | A | 2019-02-25 | 386.6667 |
| D | A | 2018-12-31 | 1200 |
| D | A | 2019-01-21 | 960 |
| D | A | 2019-01-28 | 800 |
| D | A | 2019-04-02 | 600 |
| D | A | 2019-11-02 | 720 |
| D | A | 2019-02-18 | 400 |
| D | A | 2019-02-25 | 466.6667 |
| D | A | 2019-04-03 | 650 |
| E | B | 2019-01-14 | 150 |
| E | B | 2019-01-28 | 783 |
| F | B | 2019-04-02 | 1164 |
| G | B | 2019-01-28 | 11520 |
A naive way you might do this, is to do the 8 (assuming this is bounded!) calculations and merge the results: 您可能会这样做的一个简单方法是进行8次计算(假设这是有界的!)并合并结果:
In [11]: d = {w: df.set_index('week_start_date')
.groupby(['material', 'location'])['quantity']
.rolling(f'{7*w}D', min_periods=1)
.mean()
.reset_index(name="mean")
.assign(window_size=w)
for w in range(1, 9)}
then you can concat these DataFrames together and merge with the original, since we have the window_size column in both left and right it'll inner on that. 那么您可以将这些DataFrame合并在一起并与原始DataFrame合并,因为我们在左右两侧都有window_size列,该列位于其内部。
In [12]: pd.concat(d.values()).merge(df, how="inner")
Out[12]:
material location week_start_date mean window_size quantity
0 C A 2019-01-28 870.000000 1 870.0
1 C A 2019-02-18 520.000000 1 120.0
2 C A 2019-04-03 320.000000 1 120.0
3 E B 2019-01-14 150.000000 1 150.0
4 F B 2019-04-02 1164.000000 1 1164.0
5 C A 2019-02-25 386.666667 2 120.0
6 C A 2019-04-02 920.000000 3 920.0
7 E B 2019-01-28 783.000000 3 1416.0
8 D A 2018-12-31 1200.000000 8 1200.0
9 D A 2019-01-21 960.000000 8 720.0
10 D A 2019-01-28 800.000000 8 480.0
11 D A 2019-04-02 600.000000 8 600.0
12 D A 2019-11-02 720.000000 8 720.0
13 D A 2019-02-18 400.000000 8 80.0
14 D A 2019-02-25 466.666667 8 600.0
15 D A 2019-04-03 650.000000 8 1200.0
16 G B 2019-01-28 11520.000000 8 11520.0
Note: This assumes you've set the fillna of window_size to 8: 注意:假设您已将window_size的fillna设置为8:
df.window_size = df.window_size.replace('NaN', 8).astype(int) # in your example
Further, you want to ensure you pass format to to_datetime to ensure you don't hit ambiguity, pandas may be able to do a good job here in infering it... but I wouldn't rely on it (use explicitly format='%d/%m/%Y
). 此外,您还想确保将格式传递给to_datetime以确保您不会产生歧义,熊猫也许可以在此方面做得很好……但是我不会依靠它(显式使用format='%d/%m/%Y
)。 You want to get rid of the weird date formats as soon as you read it in, this can also be passed in read_csv (dayfirst=True) and friends. 您希望在读完日期后就摆脱奇怪的日期格式,也可以将其传递给read_csv(dayfirst = True)和朋友。
I'm not entirely convinced this is what you want, since there is a difference between your input df and expected (eg there's no GB in the expected...). 我并不完全相信这就是您想要的,因为您输入的df和预期值之间存在差异(例如,预期值中没有GB ...)。
Regardless, I suspect there is a single shoot way to do this, but it will depend on the sparsity of the week/material/location (if it's dense it'll be much easier, if it's sparse this may be the best bet)... 无论如何,我怀疑只有一种拍摄方法可以做到这一点, 但这将取决于周/材质/位置的稀疏性(如果密密麻麻的话,它会容易得多;如果稀疏,这可能是最好的选择)。 ..
Now I think about it, you can do this completely on the material/location subDataFrame, can you simplify this problem to just be a function of that DataFrame (just week+value ignoring material/location) or will that apply be too slow? 现在我考虑一下,您可以完全在材质/位置subDataFrame上执行此操作,是否可以将此问题简化为该DataFrame的函数(只是周+忽略材质/位置的值),或者应用速度太慢?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.