[英]drop duplicates of one column based on duplicates of another column keeping the other column duplicates in pandas
[英]Drop duplicates on one column, breaking ties from another column
我有以下 dataframe:
x = pd.DataFrame({
"item" : ["a", "a", "a", "b", "c", "c"],
"vote" : [1, 0, 1, 1, 0, 0],
"timestamp" : ["2020-06-07 11:04:26", "2020-06-07 11:03:37", "2020-06-07 11:09:18", "2020-06-07 11:04:40", "2020-06-07 11:09:11", "2020-06-07 11:09:23"]
})
item vote timestamp
a 1 2020-06-07 11:04:26
a 0 2020-06-07 11:03:37
a 1 2020-06-07 11:09:18
b 1 2020-06-07 11:04:40
c 0 2020-06-07 11:09:11
c 0 2020-06-07 11:09:23
如何在 item 列上 drop_duplicates,并将timestamp
列用作决胜局:保留最新的? 最终的 dataframe 应如下所示:
item vote timestamp
a 1 2020-06-07 11:09:18
b 1 2020-06-07 11:04:40
c 0 2020-06-07 11:09:23
您可以在删除重复项之前对“item”和“timestamp”调用sort_values
:
x.sort_values(['item', 'timestamp']).drop_duplicates('item', keep='last')
item vote timestamp
2 a 1 2020-06-07 11:09:18
3 b 1 2020-06-07 11:04:40
5 c 0 2020-06-07 11:09:23
指定keep='last'
意味着除了最后一行之外的所有行都被丢弃,这是因为我们在上一步中对时间戳进行了排序。
(x.sort_values(['item', 'timestamp'])
.drop_duplicates('item', keep='last')
.reset_index(drop=True))
item vote timestamp
0 a 1 2020-06-07 11:09:18
1 b 1 2020-06-07 11:04:40
2 c 0 2020-06-07 11:09:23
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.