pandas GroupBy 和組中前幾行的累積平均值

Question

我有一個 dataframe 看起來像這樣：

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]: 
   category  order_start  time
0         1            1     1
1         1            2     4
2         1            3     3
3         2            1     6
4         2            2     8
5         2            3    17
6         3            1    14
7         3            2    12
8         3            3    13
9         4            1    16

我想創建一個新列，其中包含同一類別先前時間的平均值。 我怎樣才能創建它？

新列應如下所示：

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
              'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]: 
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0    = 1 / 1
2         1            3     3   2.5    = (4+1)/2
3         2            1     6   NaN
4         2            2     8   6.0    = 6 / 1
5         2            3    17   7.0    = (8+6) / 2
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

注意：如果是第一次，平均值應該是NaN。

編輯：正如cs95所述，我的問題與這個問題並不完全相同，因為這里需要擴展。

Answer 1

“創建一個包含同一類別先前時間平均值的新列”聽起來像是GroupBy.expanding （和轉變）的一個很好的用例：

df['mean'] = (
    df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

另一種計算方法是不使用apply （鏈接兩個groupby調用）：

df['mean'] = (
    df.groupby('category')['time']
      .shift()
      .groupby(df['category'])
      .expanding()
      .mean()
      .to_numpy())  # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

就性能而言，這實際上取決於您的小組的數量和規模。

Answer 2

受我的回答啟發，可以先定義一個function ：

def mean_previous(df, Category, Order, Var):
    # Order the dataframe first 
    df.sort_values([Category, Order], inplace=True)

    # Calculate the ordinary grouped cumulative sum 
    # and then substract with the grouped cumulative sum of the last order
    csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()

    # Calculate the ordinary grouped cumulative count 
    # and then substract with the grouped cumulative count of the last order
    ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()

    return csp / ccp

所需的列是

df['mean'] = mean_previous(df, 'category', 'order_start', 'time')

性能方面，我相信它非常快。

pandas GroupBy 和組中前幾行的累積平均值

問題描述

2 個解決方案

解決方案1
8 已采納 2019-06-27 22:53:34

解決方案2
1 2020-01-11 16:52:04

pandas GroupBy 和組中前幾行的累積平均值

問題描述

2 個解決方案

解決方案1 8 已采納 2019-06-27 22:53:34

解決方案2 1 2020-01-11 16:52:04

解決方案1
8 已采納 2019-06-27 22:53:34

解決方案2
1 2020-01-11 16:52:04