繁体   English   中英

Pandas 使用一些条件列值保留相同 ID 的最新行

[英]Pandas keep the latest rows for the same ID with some conditional column values

我想保留具有相同 ID 的最新行以及与某些列值匹配的行。 样本输入:

ID          Timestamp       Survey Outcome
12          11/26/2021      INCOMPLETE Survey
95          11/26/2021      INCOMPLETE Survey
95          11/27/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
12          11/29/2021      COMPLETE Survey
24          11/26/2021      RANG-But did not connect
24          11/27/2021      INCOMPLETE Survey
95          11/28/2021      RANG-But did not connect
24          11/28/2021      INCOMPLETE Survey

这里 ID 12 有两个值,所以我将保留最新的 (11/29/2021) 行。 但是对于 ID 95,一旦调查完成,它就不能有任何其他选项,例如rang-but did not connect 因此,我想保留最新的时间戳数据,并保留一旦数据完成调查但最新数据显示调查不完整或未连接的那些行(查看COMPLETE SURVEY后的所有数据)。

所以我的样品 output 将是:

ID          Timestamp       Survey Outcome
95          11/27/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
12          11/29/2021      COMPLETE Survey
95          11/28/2021      RANG-But did not connect
24          11/28/2021      INCOMPLETE Survey```


首先使用DataFrame.sort_valuesIDTimestamp ,然后在COMPLETE Survey之后对所有值使用GroupBy.cummax并添加最后一个与DataFrame.drop_duplicates不匹配的IDisin

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.sort_values(['ID','Timestamp'])

m = df['Survey Outcome'].eq('COMPLETE Survey')

df1 = df[m.groupby(df['ID']).cummax()]
df2 = df.drop_duplicates('ID', keep='last')

df = df1.append(df2[~df2['ID'].isin(df1['ID'])]).sort_index()

print (df)
   ID  Timestamp            Survey Outcome
2  95 2021-11-27           COMPLETE Survey
3  95 2021-11-28  RANG-But did not connect
4  12 2021-11-29           COMPLETE Survey
7  95 2021-11-28  RANG-But did not connect
8  24 2021-11-28         INCOMPLETE Survey

您可以使用:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.sort_values(by=['ID', 'Timestamp']).reset_index(drop=True, inplace=True)
df = df.groupby('ID').apply(lambda x: x.loc[x[x['Survey Outcome'] == 'COMPLETE Survey'].index[0]: ] if
                            x['Survey Outcome'].isin(['COMPLETE Survey']).any() else x.loc[x['Timestamp'].idxmax():]).reset_index(drop=True)
print(df)

OUTPUT

   ID  Timestamp            Survey Outcome
0  12 2021-11-29           COMPLETE Survey
1  24 2021-11-28         INCOMPLETE Survey
2  95 2021-11-27           COMPLETE Survey
3  95 2021-11-28  RANG-But did not connect
4  95 2021-11-28  RANG-But did not connect

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM