Removing duplicates from a Pandas dataframe based on the conditions of another column

Question

I need to remove duplicate rows with same p_id from the following Pandas dataframe, but using these conditions:

Highest keep priority should be given to the row containing the timestamp variable
If multiple rows are present with timestamps, the keep priority should be given the latest one
If all of the repeat instances do not contain a timestamp keep them all as is


p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P1      M       19
P3              56      
P3      F       34      2021-01-25 14:06:00

The expected output

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P3      F       34      2021-01-25 14:06:00

Answer 1

one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])

mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]

valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')

pd.concat([all_nans, valid_dates])
#output:

    p_id    sex age     timestamp
0   P1      M   23.0    2021-01-25 13:53:30
5   P3      F   34.0    2021-01-25 14:06:00
1   P4      M   NaN     NaT
2   P4      F   45.0    NaT

Removing duplicates from a Pandas dataframe based on the conditions of another column

Question

1 answers

solution1
0 ACCPTED 2021-03-02 00:07:20

Removing duplicates from a Pandas dataframe based on the conditions of another column

Question

1 answers

solution1 0 ACCPTED 2021-03-02 00:07:20

solution1
0 ACCPTED 2021-03-02 00:07:20