根据 pandas 中的前一行创建新的平均列

Question

I have dataset as below:我有如下数据集：

import pandas as pd 

df = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Delay': [2,-2,0,1,2,9,12,29,0,1],
        'Difference_Date': [0,3,1,14,11,5,3,0,38,8],
        })

I need to create two new columns which is the average of Delay and Difference_Date in 30 days of previous column's date.我需要创建两个新列，它们是上一列日期的 30 天内Delay和Difference_Date的平均值。 The data is customer-based data, so it need to be sort and group into ID .数据是基于客户的数据，因此需要进行排序和分组到ID中。

My expected output is:我预期的 output 是：


    ID  Invoice_Date    Delay   Difference_Date  Avg_Delay   Avg_Difference_Date
27459   2020-06-26       2      0                0.00        0.000000
27459   2020-06-29      -2      3                2.00        0.000000
27459   2020-06-30       0      1                0.00        1.500000
27459   2020-07-14       1      14               0.00        1.333333
27459   2020-07-25       2      11               0.25        4.500000
27459   2020-07-30       9      5                0.60        5.800000
27459   2020-08-02       12     3                4.00        10.000000
48002   2020-05-13       29     0                0.00        0.000000
48002   2020-06-20       0      38               29.00       0.000000
48002   2020-06-28       1      8                0.00        38.000000

Answer 1

You need to use a rolling approach, specifying 30 days ("30D"), then shift to consider only the past days (not including the day itself):您需要使用rolling方法，指定 30 天（“30D”），然后shift到仅考虑过去几天（不包括当天本身）：

df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'])
df = df.set_index('Invoice_Date')

df[['Avg_Delay', 'Avg_Difference_Date']] = (
    df.groupby('ID').transform(lambda x: x.rolling('30D').mean())
    .shift().fillna(0)
)

# Rearrange columns to exact match to output:
df = df.reset_index().iloc[:, [1,0] + list(range(2, df.shape[1]+1))]

Output: Output：

      ID Invoice_Date  Delay  Difference_Date  Avg_Delay  Avg_Difference_Date
0  27459   2020-06-26      2                0       0.00             0.000000
1  27459   2020-06-29     -2                3       2.00             0.000000
2  27459   2020-06-30      0                1       0.00             1.500000
3  27459   2020-07-14      1               14       0.00             1.333333
4  27459   2020-07-25      2               11       0.25             4.500000
5  27459   2020-07-30      9                5       0.60             5.800000
6  27459   2020-08-02     12                3       4.00            10.000000
7  48002   2020-05-13     29                0       6.00             8.250000
8  48002   2020-06-20      0               38      29.00             0.000000
9  48002   2020-06-28      1                8       0.00            38.000000

根据 pandas 中的前一行创建新的平均列

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-09 02:08:41

根据 pandas 中的前一行创建新的平均列

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-09 02:08:41

解决方案1
2 已采纳 2020-12-09 02:08:41