[英]Returning most recent row with certain values in Pandas
I have a dataframe sorted by ID
and in descending order of Date in Pandas that looks like我有一个 dataframe 按ID
排序并按日期降序排列在 Pandas 中,看起来像
ID Date A Salary
1 2022-12-01 2 100
1 2022-11-11 3 200
1 2022-10-25 1 150
1 2022-05-17 4 160
2 2022-12-01 2 170
2 2022-11-19 1 220
2 2022-10-10 1 160
3 2022-11-11 3 350
3 2022-09-11 1 200
3 2022-08-19 3 160
3 2022-07-20 3 190
3 2022-05-11 3 200
I would like to add a new column Salary_argmin_recent_A
that outputs the most recent Salary row of a specific ID such that A=1, so the desired output looks like我想添加一个新列Salary_argmin_recent_A
输出特定 ID 的最新 Salary 行,使得 A=1,因此所需的 output 看起来像
ID Date A Salary Salary_argmin_recent_A
1 2022-12-01 2 100 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-11-11 3 200 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-10-25 1 150 NaN (no rows before with A=1 for ID 1)
1 2022-05-17 4 160 NaN (no rows before with A=1 for ID 1)
2 2022-12-01 2 170 220
2 2022-11-19 1 220 160
2 2022-10-10 1 160 NaN
3 2022-11-11 3 350 200
3 2022-09-11 1 200 NaN
3 2022-08-19 3 160 NaN
3 2022-07-20 3 190 NaN
3 2022-05-11 3 200 NaN
Thanks in advance.提前致谢。
s1 = df['Salary'].where(df['A'].eq(1)).groupby(df['ID']).bfill()
s2 = df.groupby(['ID', 'A'])['Salary'].shift(-1)
out = df.assign(Salary_argmin_recent_A=s1.mask(df['A'].eq(1), s2))
out
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
Given you're looking for the values below, the first thing that comes to mind is to iterate through rows.如果您要查找以下值,首先想到的是遍历行。 Usually iteration should be escaped and there might be a more elegant solution to this, but at least it works.通常迭代应该被转义并且可能有更优雅的解决方案,但至少它有效。
import pandas as pd
df = pd.read_clipboard()
new_col = []
for index, row in df.iterrows():
df_below = df.iloc[index+1:]
match = df_below[(df_below.A == 1) & (df_below.ID == row.ID)].Salary
if match.any():
new_col.append(match.iloc[0])
else:
new_col.append(None)
df['Salary_argmin_recent_A'] = new_col
print(df)
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
Here I iterate through all the rows, on each iteration I take the index value and store a cut of DataFrame in df_below
which represents all the rows below current row.在这里,我遍历所有行,在每次迭代中,我获取索引值并将 DataFrame 的一个片段存储在df_below
中,它表示当前行下方的所有行。
I than filter values by A
and ID
, results may contain nothing, one or more values.我不是通过A
和ID
过滤值,结果可能不包含一个或多个值。 So I check if there are some results and if true, I take the first value, the most recent one.所以我检查是否有一些结果,如果是真的,我取第一个值,最近的一个。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.