根据另一列的条件从 Pandas dataframe 中删除重复项

Question

I need to remove duplicate rows with same p_id from the following Pandas dataframe, but using these conditions:我需要从以下 Pandas dataframe 中删除具有相同p_id的重复行，但使用以下条件：

Highest keep priority should be given to the row containing the timestamp variable最高保留优先级应给予包含时间戳变量的行
If multiple rows are present with timestamps, the keep priority should be given the latest one如果存在多行带有时间戳，则应将保留优先级赋予最新的
If all of the repeat instances do not contain a timestamp keep them all as is如果所有重复实例不包含时间戳，请保持原样


p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P1      M       19
P3              56      
P3      F       34      2021-01-25 14:06:00

The expected output预计output

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P3      F       34      2021-01-25 14:06:00

Answer 1

one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates一种可能性是首先确定 id 的所有日期在哪里 null 并与.drop_duplicates的结果连接

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])

mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]

valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')

pd.concat([all_nans, valid_dates])
#output:

    p_id    sex age     timestamp
0   P1      M   23.0    2021-01-25 13:53:30
5   P3      F   34.0    2021-01-25 14:06:00
1   P4      M   NaN     NaT
2   P4      F   45.0    NaT

根据另一列的条件从 Pandas dataframe 中删除重复项

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-02 00:07:20

根据另一列的条件从 Pandas dataframe 中删除重复项

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-02 00:07:20

解决方案1
0 已采纳 2021-03-02 00:07:20