[英]Pandas pivot_table taking most recent value if the most recent value represents a certain percentage the values present
[英]Returning most recent row with certain values in Pandas
我有一个 dataframe 按ID
排序并按日期降序排列在 Pandas 中,看起来像
ID Date A Salary
1 2022-12-01 2 100
1 2022-11-11 3 200
1 2022-10-25 1 150
1 2022-05-17 4 160
2 2022-12-01 2 170
2 2022-11-19 1 220
2 2022-10-10 1 160
3 2022-11-11 3 350
3 2022-09-11 1 200
3 2022-08-19 3 160
3 2022-07-20 3 190
3 2022-05-11 3 200
我想添加一个新列Salary_argmin_recent_A
输出特定 ID 的最新 Salary 行,使得 A=1,因此所需的 output 看起来像
ID Date A Salary Salary_argmin_recent_A
1 2022-12-01 2 100 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-11-11 3 200 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-10-25 1 150 NaN (no rows before with A=1 for ID 1)
1 2022-05-17 4 160 NaN (no rows before with A=1 for ID 1)
2 2022-12-01 2 170 220
2 2022-11-19 1 220 160
2 2022-10-10 1 160 NaN
3 2022-11-11 3 350 200
3 2022-09-11 1 200 NaN
3 2022-08-19 3 160 NaN
3 2022-07-20 3 190 NaN
3 2022-05-11 3 200 NaN
提前致谢。
s1 = df['Salary'].where(df['A'].eq(1)).groupby(df['ID']).bfill()
s2 = df.groupby(['ID', 'A'])['Salary'].shift(-1)
out = df.assign(Salary_argmin_recent_A=s1.mask(df['A'].eq(1), s2))
out
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
如果您要查找以下值,首先想到的是遍历行。 通常迭代应该被转义并且可能有更优雅的解决方案,但至少它有效。
import pandas as pd
df = pd.read_clipboard()
new_col = []
for index, row in df.iterrows():
df_below = df.iloc[index+1:]
match = df_below[(df_below.A == 1) & (df_below.ID == row.ID)].Salary
if match.any():
new_col.append(match.iloc[0])
else:
new_col.append(None)
df['Salary_argmin_recent_A'] = new_col
print(df)
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
在这里,我遍历所有行,在每次迭代中,我获取索引值并将 DataFrame 的一个片段存储在df_below
中,它表示当前行下方的所有行。
我不是通过A
和ID
过滤值,结果可能不包含一个或多个值。 所以我检查是否有一些结果,如果是真的,我取第一个值,最近的一个。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.