[英]Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B
[英]Pandas: Drop duplicates in col[A] keeping row based on condition on col[B]
给定数据框:
df = pd.DataFrame({'col1': ['A', 'A', 'A','B','B'], 'col2': ['type1', 'type2', 'type1', 'type2', 'type1'] , 'hour': ['18:03:30','18:00:48', '18:13:46', '18:11:29', '18:06:31'] })
col1 col2 hour
A type1 18:03:30 # Drop this row as (A type1) already present
A type2 18:00:48
A type1 18:13:46 # keep this row as (A type1) already present.
B type2 18:11:29
B type1 18:06:31
我想删除基于col1,col2的重复项 。
例如(row(0):A type1,row(2):A type1)
仅保留 最近一小时的行,例如(18:13:46)。
我尝试使用groupby返回基于col1的子集,并使用drop_duplicates将重复项删除到col2中。 我需要找到一种通过条件的方法(最新时间)
示例代码:
for key, grp in df.groupby('col1'):
grp.drop_duplicates(subset='col2', keep="LATEST OF HOUR")
预期结果:
col1 col2 hour
A type1 18:03:30
A type2 18:00:48
B type2 18:11:29
B type1 18:06:31
我的原始数据框更大,该解决方案还需要工作:
col1 col2 other hour
A type1 h 18:03:30 # Drop this row as (A type1) already present
A type2 ss 18:00:48
A type1 ll 18:13:46 # keep this row as (A type1) already present
B type2 mm 18:11:29
B type1 jj 18:06:31
仍然需要根据小时删除列
df.drop_duplicates(['col1','col2'] , keep = 'last')
按照anky_91的评论,我像这样解决了它:
df.sort_values('hour').drop_duplicates(['col1','col2'] , keep = 'last')
该排序基于“小时”列进行,因此您可以确保keep ='last'获得最后一个元素
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.