简体   繁体   中英

Removing duplicates from a Pandas dataframe based on the conditions of another column

I need to remove duplicate rows with same p_id from the following Pandas dataframe, but using these conditions:

  1. Highest keep priority should be given to the row containing the timestamp variable
  2. If multiple rows are present with timestamps, the keep priority should be given the latest one
  3. If all of the repeat instances do not contain a timestamp keep them all as is

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P1      M       19
P3              56      
P3      F       34      2021-01-25 14:06:00 

The expected output

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P3      F       34      2021-01-25 14:06:00 

one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])

mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]

valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')

pd.concat([all_nans, valid_dates])
#output:

    p_id    sex age     timestamp
0   P1      M   23.0    2021-01-25 13:53:30
5   P3      F   34.0    2021-01-25 14:06:00
1   P4      M   NaN     NaT
2   P4      F   45.0    NaT

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM