简体   繁体   English

根据另一列的条件从 Pandas dataframe 中删除重复项

[英]Removing duplicates from a Pandas dataframe based on the conditions of another column

I need to remove duplicate rows with same p_id from the following Pandas dataframe, but using these conditions:我需要从以下 Pandas dataframe 中删除具有相同p_id的重复行,但使用以下条件:

  1. Highest keep priority should be given to the row containing the timestamp variable最高保留优先级应给予包含时间戳变量的行
  2. If multiple rows are present with timestamps, the keep priority should be given the latest one如果存在多行带有时间戳,则应将保留优先级赋予最新的
  3. If all of the repeat instances do not contain a timestamp keep them all as is如果所有重复实例不包含时间戳,请保持原样

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P1      M       19
P3              56      
P3      F       34      2021-01-25 14:06:00 

The expected output预计output

p_id    sex     age     timestamp
P1      M       23      2021-01-25 13:53:30
P4      M
P4      F       45
P3      F       34      2021-01-25 14:06:00 

one possibility is to first identify where all the dates of an id are null and concatenate with the result of a .drop_duplicates一种可能性是首先确定 id 的所有日期在哪里 null 并与.drop_duplicates的结果连接

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['p_id','timestamp'], ascending=[True,False])

mask = df.groupby('p_id')['timestamp'].transform('count') == 0
all_nans = df[mask]

valid_dates = df[df['timestamp'].notna()].drop_duplicates('p_id', keep = 'first')

pd.concat([all_nans, valid_dates])
#output:

    p_id    sex age     timestamp
0   P1      M   23.0    2021-01-25 13:53:30
5   P3      F   34.0    2021-01-25 14:06:00
1   P4      M   NaN     NaT
2   P4      F   45.0    NaT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从具有基于另一列的条件的 pandas 数据帧中删除重复项 - Removing duplicates from pandas data frame with condition based on another column 根据列值条件删除python pandas数据框中的重复项 - Removing duplicates in python pandas dataframe based in column value condition Pandas:根据某些条件删除重复行 - Pandas : Removing duplicates row based on some conditions 通过从包含列表 pandas 的列中删除重复项来过滤 dataframe - Filter dataframe by removing duplicates from column containing list pandas 根据另一列上的条件修改Pandas DataFrame列 - Modifying a Pandas DataFrame column based on conditions on another column 根据条件将值从一列到另一列熊猫数据框中 - Give values from one column to another column in pandas dataframe based on conditions Pandas dataframe 删除基于另一列值的重复项 - Pandas dataframe drop duplicates based in another column value 基于来自另一个数据帧的条件填充熊猫数据帧的有效方法 - efficient way to populate pandas dataframe based on conditions from another dataframe 根据另一行的条件在 Pandas dataframe 中创建新列的最佳方法是什么? - What is the optimal way to create a new column in Pandas dataframe based on conditions from another row? 根据另一个列表从pandas dataframe列中的列表中删除值 - Removing values form a list in pandas dataframe column based on another list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM