[英]Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value
我有一个数据框df:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
2 2015-05-02 18:27:10 2015-06-06 03:52:03 http://goo.gl/NMqjd1
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
我想删除具有相同“ first_seen”,“ uri”的行,并仅保留具有最新last_seen的行。
这是expected
数据集的示例:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
没有人知道谁不写for循环就可以做到吗?
调用drop_duplicates
并将要考虑进行重复匹配的列作为subset
的args并设置参数take_last=True
:
In [295]:
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
index first_seen last_seen uri
1 1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
编辑
为了获取最新日期,您需要先在“ first_seen”和“ last_seen”上对df进行排序:
n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[317]:
index first_seen last_seen uri
0 0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.