I need to remove duplicate rows with same p_id
from the following Pandas dataframe, but using these conditions:
p_id sex age timestamp
P1 M 23 2021-01-25 13:53:30
P4 M
P4 F 45
P1 M 19
P3 56
P3 F 34 2021-01-25 14:06:00
The expected output
p_id sex age timestamp
P1 M 23 2021-01-25 13:53:30
P4 M
P4 F 45
P3 F 34 2021-01-25 14:06:00
one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])
mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]
valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')
pd.concat([all_nans, valid_dates])
#output:
p_id sex age timestamp
0 P1 M 23.0 2021-01-25 13:53:30
5 P3 F 34.0 2021-01-25 14:06:00
1 P4 M NaN NaT
2 P4 F 45.0 NaT
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.