繁体   English   中英

将重复项放在一列上,打破另一列的联系

[英]Drop duplicates on one column, breaking ties from another column

我有以下 dataframe:

x = pd.DataFrame({
    "item" : ["a", "a", "a", "b", "c", "c"],
    "vote" : [1, 0, 1, 1, 0, 0],
    "timestamp" : ["2020-06-07 11:04:26", "2020-06-07 11:03:37", "2020-06-07 11:09:18", "2020-06-07 11:04:40", "2020-06-07 11:09:11", "2020-06-07 11:09:23"]
})

item   vote   timestamp
a      1      2020-06-07 11:04:26
a      0      2020-06-07 11:03:37
a      1      2020-06-07 11:09:18
b      1      2020-06-07 11:04:40      
c      0      2020-06-07 11:09:11
c      0      2020-06-07 11:09:23

如何在 item 列上 drop_duplicates,并将timestamp列用作决胜局:保留最新的? 最终的 dataframe 应如下所示:

item   vote   timestamp
a      1      2020-06-07 11:09:18
b      1      2020-06-07 11:04:40      
c      0      2020-06-07 11:09:23

您可以在删除重复项之前对“item”和“timestamp”调用sort_values

x.sort_values(['item', 'timestamp']).drop_duplicates('item', keep='last')

  item  vote            timestamp
2    a     1  2020-06-07 11:09:18
3    b     1  2020-06-07 11:04:40
5    c     0  2020-06-07 11:09:23

指定keep='last'意味着除了最后一行之外的所有行都被丢弃,这是因为我们在上一步中对时间戳进行了排序。


(x.sort_values(['item', 'timestamp'])
  .drop_duplicates('item', keep='last')
  .reset_index(drop=True))

  item  vote            timestamp
0    a     1  2020-06-07 11:09:18
1    b     1  2020-06-07 11:04:40
2    c     0  2020-06-07 11:09:23

另一种方式;

  x['timestamp']=pd.to_datetime(x['timestamp'])#Coerce timestamp to datetime
  x.set_index('timestamp', inplace=True)#set timestamp as index
  x2=x.groupby([x.index.date,x['item']])['vote'].agg(vote='last').reset_index()
  x2.columns=['timestamp','item','vote']

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM