如何在 Pandas 上使用 group by 应用累积自定义聚合函数

Question

I have the following DataFrame我有以下数据帧

df = pd.DataFrame({'model': ['A0', 'A0', 'A1', 'A1','A0', 'A0', 'A1', 'A1', 'A0', 'A0', 'A1', 'A1'],
                    'y_true': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11],
                    'y_pred': [0, 1, 5, 5, 7, 8, 8, 12, 8, 7, 14, 15],
                    'week': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]},
                  )

   model  y_true  y_pred  week
0   A0       1       0     1
1   A0       2       1     1
2   A1       3       5     1
3   A1       3       5     1
4   A0       4       7     2
5   A0       5       8     2
6   A1       6       8     2
7   A1       7      12     2
8   A0       8       8     3
9   A0       9       7     3
10  A1      10      14     3
11  A1      11      15     3

And I want to make some metrics calculus with sklearn, so I did this function我想用sklearn做一些度量演算，所以我做了这个功能

from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score
import numpy as np
def metrics(df):
    y_true=np.asarray(df['y_true'])
    y_pred=np.asarray(df['y_pred'])
    mae=mean_absolute_error(y_true, y_pred)
    mse=mean_squared_error(y_true, y_pred)
    evs=explained_variance_score(y_true, y_pred)
    return mae,mse,evs

I tried to make a group by in this way我试图以这种方式组成一个组

df.groupby(['model', 'week']).apply(metrics)

It returns me the metrics for every week, but I want the metrics to be accumulative since week 1 to the other weeks.它为我返回每周的指标，但我希望这些指标从第 1 周开始累积到其他周。 I mean:我的意思是：

1. For the results of week 1 I want the metrics of y_true and y_pred where the column week takes the value 1.
2. For the results of week 2 I want the metrics of y_true and y_pred where the column week takes the values 1 or 2
3. For the results of week 3 I want the metrics of y_true and y_pred where the column week takes the values 1, 2 or 3

A partial solution is this, but is not what I want.部分解决方案是这样，但不是我想要的。

              y_true    y_pred                              y_true_cum  \
model week                                                               
A0    1       [1, 2]    [0, 1]                                  [1, 2]   
      2       [4, 5]    [7, 8]                            [1, 2, 4, 5]   
      3       [8, 9]    [8, 7]                      [1, 2, 4, 5, 8, 9]   
A1    1       [3, 3]    [5, 5]                [1, 2, 4, 5, 8, 9, 3, 3]   
      2       [6, 7]   [8, 12]          [1, 2, 4, 5, 8, 9, 3, 3, 6, 7]   
      3     [10, 11]  [14, 15]  [1, 2, 4, 5, 8, 9, 3, 3, 6, 7, 10, 11]

I wanted every model to have his own accumulative weeks:我希望每个模型都有自己的累积周数：

              y_true    y_pred                              y_true_cum  \
model week                                                               
A0    1       [1, 2]    [0, 1]                                  [1, 2]   
      2       [4, 5]    [7, 8]                            [1, 2, 4, 5]   
      3       [8, 9]    [8, 7]                      [1, 2, 4, 5, 8, 9]   
A1    1       [3, 3]    [5, 5]                                  [3, 3]   
      2       [6, 7]   [8, 12]                           [ 3, 3, 6, 7]   
      3     [10, 11]  [14, 15]                   [ 3, 3, 6, 7, 10, 11]

Answer 1

This should do it:这应该这样做：

import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score

df = pd.DataFrame({
    'model': ['A0', 'A0', 'A1', 'A1','A0', 'A0', 'A1', 'A1', 'A0', 'A0', 'A1', 'A1'],
    'week': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    'y_true': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'y_pred': [0, 1, 5, 5, 7, 8, 8, 12, 8, 7, 14, 15]
})

def metrics(df):
    df['mae'] = mean_absolute_error(df.y_true, df.y_pred)
    df['mse'] = mean_squared_error(df.y_true, df.y_pred)
    df['evs'] = explained_variance_score(df.y_true, df.y_pred)
    return df


# groupby model, week and keep all values of y_true/y_pred as lists
df_group = df.groupby(['model', 'week']).agg(list)

# accumulate values for y_true and y_pred
df_group = df_group.groupby('model')['y_true', 'y_pred'].apply(lambda x: x.cumsum())

# apply metrics to new columns
df_group.apply(metrics, axis=1)

Answer 2

Answer in addition to RubenB : a small modification of his code allows for what's asked.除了 RubenB 之外的答案：对他的代码进行小幅修改可以满足要求。

This comes after:这之后：

df_group = df.groupby(['model', 'week']).agg(lambda x: list(x))

We can use cumsum on certain parts:我们可以在某些部分使用cumsum ：

for col in ['y_true','y_pred']:
    df_group[f'{col}_cum'] = None
df_group = df_group.reset_index().set_index('model') #this is for convenience
for col in ['y_true','y_pred']:
    for model in df_group.index: #now we do this once per model
        df_group.loc[model,f'{col}_cum'] = df_group.loc[model,col].cumsum()

And finally, as RubenB did:最后，正如 RubenB 所做的那样：

df_group.apply(metrics, axis=1)

Attempt without the extra loop - this turns into a messy lambda function, though.在没有额外循环的情况下尝试 - 但这会变成一个凌乱的 lambda 函数。

df_group = df.groupby(['model', 'week']).agg(lambda x: list(x))
df_group = df_group.reset_index()
for col in ['y_true','y_pred']:
    df_group[f'{col}_cum'] = df_group.apply(lambda x:
         df_group.loc[(df_group.model==x.model)&(df_group.week<=x.week),col].sum(),axis=1)

And finally:最后：

df_group.set_index(['model','week']).apply(metrics, axis=1)

如何在 Pandas 上使用 group by 应用累积自定义聚合函数

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-01-28 05:07:25

解决方案2
0 2020-01-28 11:45:16

如何在 Pandas 上使用 group by 应用累积自定义聚合函数

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-01-28 05:07:25

解决方案2 0 2020-01-28 11:45:16

解决方案1
1 已采纳 2020-01-28 05:07:25

解决方案2
0 2020-01-28 11:45:16