[英]Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value
我有一個數據框df:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
2 2015-05-02 18:27:10 2015-06-06 03:52:03 http://goo.gl/NMqjd1
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
我想刪除具有相同“ first_seen”,“ uri”的行,並僅保留具有最新last_seen的行。
這是expected
數據集的示例:
first_seen last_seen uri
0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
沒有人知道誰不寫for循環就可以做到嗎?
調用drop_duplicates
並將要考慮進行重復匹配的列作為subset
的args並設置參數take_last=True
:
In [295]:
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
index first_seen last_seen uri
1 1 2015-05-11 23:08:46 2015-05-11 23:08:46 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
編輯
為了獲取最新日期,您需要先在“ first_seen”和“ last_seen”上對df進行排序:
n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[317]:
index first_seen last_seen uri
0 0 2015-05-11 23:08:46 2015-05-11 23:08:50 http://11i-ssaintandder.com/
3 3 2015-05-02 18:27:10 2015-06-08 08:44:53 http://goo.gl/NMqjd1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.