簡體   English   中英

如何按組獲取前一行的累積最大值?

[英]How to get the cumulative maximum by group for previous rows?

我正在嘗試計算此表中的p_max_stat列:

+-----+------+--------+------------+
| id_ | p_id | p_stat | p_max_stat |
+-----+------+--------+------------+
|   1 |    1 |      1 | NaN        |
|   2 |    1 |      2 | 1          |
|   3 |    1 |      3 | 2          |
|   4 |    1 |      4 | 3          |
|   5 |    1 |      3 | 4          |
|   6 |    1 |      2 | 4          |
|   1 |    2 |      0 | NaN        |
|   2 |    2 |      0 | 0          |
|   3 |    2 |      0 | 0          |
|   4 |    2 |      0 | 0          |
|   5 |    2 |      0 | 0          |
|   6 |    2 |      0 | 0          |
+-----+------+--------+------------+

其中p_max_stat是由p_id分組的先前行p_stat的最大值。

我已經做到了:

data = {
    'id_': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 1, 7: 2, 8: 3, 9: 4, 10: 5, 11: 6},
    'p_id': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 2, 7: 1, 8: 2, 9: 1, 10: 2, 11: 1},
    'p_stat': {0: 1, 1: 0, 2: 3, 3: 0, 4: 3, 5: 0, 6: 0, 7: 2, 8: 0, 9: 4, 10: 0, 11: 2},
    'p_max_stat': {0: np.NaN, 1: 0.0, 2: 2.0, 3: 0.0, 4: 4.0, 5: 0.0, 6: np.NaN, 7: 1.0, 8: 0.0, 9: 3.0, 10: 0.0, 11: 4.0}
}
df = pd.DataFrame(data)
df.sort_values(["p_id", "id_"], inplace=True)
df["p_max_stat_incorrect"] = (
    df
    .groupby(["p_id"])["p_stat"]
    .shift()
    .cummax()
)

這讓我得到了p_id == 1的正確值,但p_id == 2的值不正確:

+-----+------+--------+------------+----------------------+
| id_ | p_id | p_stat | p_max_stat | p_max_stat_incorrect |
+-----+------+--------+------------+----------------------+
|   1 |    1 |      1 | NaN        | NaN                  |
|   2 |    1 |      2 | 1          | 1                    |
|   3 |    1 |      3 | 2          | 2                    |
|   4 |    1 |      4 | 3          | 3                    |
|   5 |    1 |      3 | 4          | 4                    |
|   6 |    1 |      2 | 4          | 4                    |
|   1 |    2 |      0 | NaN        | NaN                  |
|   2 |    2 |      0 | 0          | 4                    |
|   3 |    2 |      0 | 0          | 4                    |
|   4 |    2 |      0 | 0          | 4                    |
|   5 |    2 |      0 | 0          | 4                    |
|   6 |    2 |      0 | 0          | 4                    |
+-----+------+--------+------------+----------------------+

我哪里錯了?

groupby shift返回一個Series 所以下面的cummax調用是Series.cummax而不是Groupby.cummax所以 function 應用於整個列而不是組內。

為了解決這個問題,我們可以使用兩個 groupby:

df["p_max_stat"] = (
    df
        .groupby(["p_id"])["p_stat"]
        .shift()
        .groupby(df["p_id"])
        .cummax()
)

groupby applySeries方法一起應用:

df["p_max_stat"] = (
    df.groupby("p_id")["p_stat"].apply(lambda s: s.shift().cummax())
)

df

 id_  p_id  p_stat  p_max_stat
   1     1       1         NaN
   2     1       2         1.0
   3     1       3         2.0
   4     1       4         3.0
   5     1       3         4.0
   6     1       2         4.0
   1     2       0         NaN
   2     2       0         0.0
   3     2       0         0.0
   4     2       0         0.0
   5     2       0         0.0
   6     2       0         0.0

設置:

import pandas as pd

df = pd.DataFrame({
    'id_': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
    'p_id': [1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1],
    'p_stat': [1, 0, 3, 0, 3, 0, 0, 2, 0, 4, 0, 2]
}).sort_values(["p_id", "id_"])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM