[英]Removing duplicates from a Pandas dataframe based on the conditions of another column
I need to remove duplicate rows with same p_id
from the following Pandas dataframe, but using these conditions:我需要从以下 Pandas dataframe 中删除具有相同p_id
的重复行,但使用以下条件:
p_id sex age timestamp
P1 M 23 2021-01-25 13:53:30
P4 M
P4 F 45
P1 M 19
P3 56
P3 F 34 2021-01-25 14:06:00
The expected output预计output
p_id sex age timestamp
P1 M 23 2021-01-25 13:53:30
P4 M
P4 F 45
P3 F 34 2021-01-25 14:06:00
one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates
一种可能性是首先确定 id 的所有日期在哪里 null 并与.drop_duplicates
的结果连接
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])
mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]
valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')
pd.concat([all_nans, valid_dates])
#output:
p_id sex age timestamp
0 P1 M 23.0 2021-01-25 13:53:30
5 P3 F 34.0 2021-01-25 14:06:00
1 P4 M NaN NaT
2 P4 F 45.0 NaT
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.