[英]How to get the cumulative maximum by group for previous rows?
我正在嘗試計算此表中的p_max_stat
列:
+-----+------+--------+------------+
| id_ | p_id | p_stat | p_max_stat |
+-----+------+--------+------------+
| 1 | 1 | 1 | NaN |
| 2 | 1 | 2 | 1 |
| 3 | 1 | 3 | 2 |
| 4 | 1 | 4 | 3 |
| 5 | 1 | 3 | 4 |
| 6 | 1 | 2 | 4 |
| 1 | 2 | 0 | NaN |
| 2 | 2 | 0 | 0 |
| 3 | 2 | 0 | 0 |
| 4 | 2 | 0 | 0 |
| 5 | 2 | 0 | 0 |
| 6 | 2 | 0 | 0 |
+-----+------+--------+------------+
其中p_max_stat
是由p_id
分組的先前行的p_stat
的最大值。
我已經做到了:
data = {
'id_': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 1, 7: 2, 8: 3, 9: 4, 10: 5, 11: 6},
'p_id': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 2, 7: 1, 8: 2, 9: 1, 10: 2, 11: 1},
'p_stat': {0: 1, 1: 0, 2: 3, 3: 0, 4: 3, 5: 0, 6: 0, 7: 2, 8: 0, 9: 4, 10: 0, 11: 2},
'p_max_stat': {0: np.NaN, 1: 0.0, 2: 2.0, 3: 0.0, 4: 4.0, 5: 0.0, 6: np.NaN, 7: 1.0, 8: 0.0, 9: 3.0, 10: 0.0, 11: 4.0}
}
df = pd.DataFrame(data)
df.sort_values(["p_id", "id_"], inplace=True)
df["p_max_stat_incorrect"] = (
df
.groupby(["p_id"])["p_stat"]
.shift()
.cummax()
)
這讓我得到了p_id == 1
的正確值,但p_id == 2
的值不正確:
+-----+------+--------+------------+----------------------+
| id_ | p_id | p_stat | p_max_stat | p_max_stat_incorrect |
+-----+------+--------+------------+----------------------+
| 1 | 1 | 1 | NaN | NaN |
| 2 | 1 | 2 | 1 | 1 |
| 3 | 1 | 3 | 2 | 2 |
| 4 | 1 | 4 | 3 | 3 |
| 5 | 1 | 3 | 4 | 4 |
| 6 | 1 | 2 | 4 | 4 |
| 1 | 2 | 0 | NaN | NaN |
| 2 | 2 | 0 | 0 | 4 |
| 3 | 2 | 0 | 0 | 4 |
| 4 | 2 | 0 | 0 | 4 |
| 5 | 2 | 0 | 0 | 4 |
| 6 | 2 | 0 | 0 | 4 |
+-----+------+--------+------------+----------------------+
我哪里錯了?
groupby shift
返回一個Series
。 所以下面的cummax
調用是Series.cummax而不是Groupby.cummax所以 function 應用於整個列而不是組內。
為了解決這個問題,我們可以使用兩個 groupby:
df["p_max_stat"] = (
df
.groupby(["p_id"])["p_stat"]
.shift()
.groupby(df["p_id"])
.cummax()
)
或groupby apply
與Series
方法一起應用:
df["p_max_stat"] = (
df.groupby("p_id")["p_stat"].apply(lambda s: s.shift().cummax())
)
df
:
id_ p_id p_stat p_max_stat
1 1 1 NaN
2 1 2 1.0
3 1 3 2.0
4 1 4 3.0
5 1 3 4.0
6 1 2 4.0
1 2 0 NaN
2 2 0 0.0
3 2 0 0.0
4 2 0 0.0
5 2 0 0.0
6 2 0 0.0
設置:
import pandas as pd
df = pd.DataFrame({
'id_': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
'p_id': [1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1],
'p_stat': [1, 0, 3, 0, 3, 0, 0, 2, 0, 4, 0, 2]
}).sort_values(["p_id", "id_"])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.