熊猫分组滚动平均值与自定义窗口大小

Question

Problem definition: 问题定义：

For a Pandas DataFrame I'm trying to get a grouped by rolling mean with a changeable window size specified on each row that's relative to a date time index. 对于Pandas DataFrame，我试图通过滚动均值进行分组，并在相对于日期时间索引的每一行上指定可更改的窗口大小。

Example: 例：

For the following df of weekly data: 对于以下每周数据df ：

| week_start_date | material | location | quantity | window_size |
|-----------------|----------|----------|----------|-------------|
| 2019-01-28      | C        | A        | 870      | 1           |
| 2019-02-04      | C        | A        | 920      | 3           |
| 2019-02-18      | C        | A        | 120      | 1           |
| 2019-02-25      | C        | A        | 120      | 2           |
| 2019-03-04      | C        | A        | 120      | 1           |
| 2018-12-31      | D        | A        | 1200     | 8           |
| 2019-01-21      | D        | A        | 720      | 8           |
| 2019-01-28      | D        | A        | 480      | 8           |
| 2019-02-04      | D        | A        | 600      | 8           |
| 2019-02-11      | D        | A        | 720      | 8           |
| 2019-02-18      | D        | A        | 80       | 8           |
| 2019-02-25      | D        | A        | 600      | 8           |
| 2019-03-04      | D        | A        | 1200     | 8           |
| 2019-01-14      | E        | B        | 150      | 1           |
| 2019-01-28      | E        | B        | 1416     | 1           |
| 2019-02-04      | F        | B        | 1164     | 1           |
| 2019-01-28      | G        | B        | 11520    | 8           |

The window needs to be relative to the actual date set in week_start_date , rather than treating it like an integer index. 窗口需要相对于week_start_date设置的实际日期，而不是像整数索引一样对待。

It needs to be grouped by material and location . 需要根据material和location进行分组。

The rolling mean is for column quantity . 滚动平均值用于列quantity 。

The window size needs to vary/change based on the value in the window_size column. 窗口大小需要根据window_size列中的值进行更改。 This value changes over time - it represents the number of weeks back in time that quantity needs to be aggregated for. 该值随时间变化-表示需要汇总数量的时间倒退的周数。

When a row isn't available, the mean should assume that value is 0, ie: when a week-dated row isn't available mean(null, null, null, 1000) = 1000 but it should actually: mean(0,0,0,1000)=250 However - this should only apply after the first observation has been measured. 当某行不可用时，均值应假定值为0，即：当某周日期的行不可用时， mean(null, null, null, 1000) = 1000但实际上应为：mean（0， 0,0,1000）= 250但是-这仅应在测量到第一个观测值之后才适用。

Fixed window, relative to date column: 固定的窗口，相对于日期列：

I can get a static window of 8 weeks (56 days) using the following: 我可以使用以下方法获得8周（56天）的静态窗口：

df.set_index('week_start_date').groupby(['material', 'location'])['quantity'].rolling('56D', min_periods=1).mean()

I've explored use of expanding but haven't been successful. 我已经探索了使用扩展的方法，但没有成功。

How can the window size be set relative to each row it reads? 如何相对于读取的每一行设置窗口大小？

Sample Data: 样本数据：

# Example Data
df = pd.DataFrame({'week_start_date': ['2019-01-28','2019-02-04','2019-02-18','2019-02-25','2019-03-04','2018-12-31','2019-01-21','2019-01-28','2019-02-04','2019-02-11','2019-02-18','2019-02-25','2019-03-04','2019-01-14','2019-01-28','2019-02-04','2019-01-28'],
'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E','E','F','G'],
'location': ['A','A','A','A','A','A','A','A','A','A','A','A','A','B','B','B','B'],
'quantity': ['870','920','120','120','120','1200','720','480','600','720','80','600','1200','150','1416','1164','11520'],
'min_of_pdt_or_8_weeks': ['1','3','1','2','1','8','8','8','8','8','8','8','8','1','3','1','8']})
# Fix formats
df['week_start_date'] = pd.to_datetime(df['week_start_date'])
df['actual_week_qty'] = df['quantity'].astype(float)

Expected result: 预期结果：

| material | location | week_start_date | quantity | 
| C        | A        | 2019-01-28      | 870      | 
| C        | A        | 2019-04-02      | 306.6667 | 
| C        | A        | 2019-02-18      | 520      | 
| C        | A        | 2019-02-25      | 386.6667 | 
| D        | A        | 2018-12-31      | 1200     | 
| D        | A        | 2019-01-21      | 960      | 
| D        | A        | 2019-01-28      | 800      | 
| D        | A        | 2019-04-02      | 600      | 
| D        | A        | 2019-11-02      | 720      | 
| D        | A        | 2019-02-18      | 400      | 
| D        | A        | 2019-02-25      | 466.6667 | 
| D        | A        | 2019-04-03      | 650      | 
| E        | B        | 2019-01-14      | 150      | 
| E        | B        | 2019-01-28      | 783      | 
| F        | B        | 2019-04-02      | 1164     | 
| G        | B        | 2019-01-28      | 11520    |

Answer 1

A naive way you might do this, is to do the 8 (assuming this is bounded!) calculations and merge the results: 您可能会这样做的一个简单方法是进行8次计算（假设这是有界的！）并合并结果：

In [11]: d = {w: df.set_index('week_start_date')
                   .groupby(['material', 'location'])['quantity']
                   .rolling(f'{7*w}D', min_periods=1)
                   .mean()
                   .reset_index(name="mean")
                   .assign(window_size=w)
              for w in range(1, 9)}

then you can concat these DataFrames together and merge with the original, since we have the window_size column in both left and right it'll inner on that. 那么您可以将这些DataFrame合并在一起并与原始DataFrame合并，因为我们在左右两侧都有window_size列，该列位于其内部。

In [12]: pd.concat(d.values()).merge(df, how="inner")
Out[12]:
   material location week_start_date          mean  window_size  quantity
0         C        A      2019-01-28    870.000000            1     870.0
1         C        A      2019-02-18    520.000000            1     120.0
2         C        A      2019-04-03    320.000000            1     120.0
3         E        B      2019-01-14    150.000000            1     150.0
4         F        B      2019-04-02   1164.000000            1    1164.0
5         C        A      2019-02-25    386.666667            2     120.0
6         C        A      2019-04-02    920.000000            3     920.0
7         E        B      2019-01-28    783.000000            3    1416.0
8         D        A      2018-12-31   1200.000000            8    1200.0
9         D        A      2019-01-21    960.000000            8     720.0
10        D        A      2019-01-28    800.000000            8     480.0
11        D        A      2019-04-02    600.000000            8     600.0
12        D        A      2019-11-02    720.000000            8     720.0
13        D        A      2019-02-18    400.000000            8      80.0
14        D        A      2019-02-25    466.666667            8     600.0
15        D        A      2019-04-03    650.000000            8    1200.0
16        G        B      2019-01-28  11520.000000            8   11520.0

Note: This assumes you've set the fillna of window_size to 8: 注意：假设您已将window_size的fillna设置为8：

df.window_size = df.window_size.replace('NaN', 8).astype(int)  # in your example

Further, you want to ensure you pass format to to_datetime to ensure you don't hit ambiguity, pandas may be able to do a good job here in infering it... but I wouldn't rely on it (use explicitly format='%d/%m/%Y ). 此外，您还想确保将格式传递给to_datetime以确保您不会产生歧义，熊猫也许可以在此方面做得很好……但是我不会依靠它（显式使用format='%d/%m/%Y ）。 You want to get rid of the weird date formats as soon as you read it in, this can also be passed in read_csv (dayfirst=True) and friends. 您希望在读完日期后就摆脱奇怪的日期格式，也可以将其传递给read_csv（dayfirst = True）和朋友。

I'm not entirely convinced this is what you want, since there is a difference between your input df and expected (eg there's no GB in the expected...). 我并不完全相信这就是您想要的，因为您输入的df和预期值之间存在差异（例如，预期值中没有GB ...）。

Regardless, I suspect there is a single shoot way to do this, but it will depend on the sparsity of the week/material/location (if it's dense it'll be much easier, if it's sparse this may be the best bet)... 无论如何，我怀疑只有一种拍摄方法可以做到这一点，但这将取决于周/材质/位置的稀疏性（如果密密麻麻的话，它会容易得多；如果稀疏，这可能是最好的选择）。 ..
Now I think about it, you can do this completely on the material/location subDataFrame, can you simplify this problem to just be a function of that DataFrame (just week+value ignoring material/location) or will that apply be too slow? 现在我考虑一下，您可以完全在材质/位置subDataFrame上执行此操作，是否可以将此问题简化为该DataFrame的函数（只是周+忽略材质/位置的值），或者应用速度太慢？

熊猫分组滚动平均值与自定义窗口大小

问题描述

Problem definition: 问题定义：

Example: 例：

Fixed window, relative to date column: 固定的窗口，相对于日期列：

Sample Data: 样本数据：

Expected result: 预期结果：

1 个解决方案

解决方案1
0 已采纳 2019-04-19 04:36:01

熊猫分组滚动平均值与自定义窗口大小

问题描述

Problem definition: 问题定义：

Example: 例：

Fixed window, relative to date column: 固定的窗口，相对于日期列：

Sample Data: 样本数据：

Expected result: 预期结果：

1 个解决方案

解决方案1 0 已采纳 2019-04-19 04:36:01

解决方案1
0 已采纳 2019-04-19 04:36:01