[英]Sort by columns and only keep the first line until next value in column 1
I have a file with roughly 10m lines. 我有一个大约10米行的文件。 Each line is most likely unique, but I'm sorting the file by column 1 then 2 then 3.
每一行很可能都是唯一的,但我按照第1列然后是2然后3对文件进行排序。
Column 1 = CODE
Column 2 = DATE
Column 3 = AMOUNT
I only want to keep the first line until the next date and so on. 我只想保留第一行直到下一个日期,依此类推。 Below is an example of what I have and what I need the output to be.
下面是我所拥有的以及我需要输出的示例。
Original:
COL1 COL2 COL3
ABA 2019-01-01 100
ABA 2019-01-01 111
ABA 2019-01-02 140
ABA 2019-01-02 150
ABA 2019-01-03 200
ABA 2019-01-03 220
Ouptut needed:
COL1 COL2 COL3
ABA 2019-01-01 100
ABA 2019-01-02 140
ABA 2019-01-03 200
Anyone able to help me. 任何人都能帮助我。 Have tried
试过
a.drop_duplicates(subset[data.columns[0],data.columns[1],data.columns[2]], keep='first')
尝试groupby然后先:
a.groupby([data.columns[0],data.columns[1]], as_index=False).first()
Your solution is almost correct. 您的解决方案几乎正确。 This version is a modified version:
此版本是修改后的版本:
>> a.drop_duplicates(subset = [a.columns[0],a.columns[1]], keep='first')
That produces: 这产生:
COL1 COL2 COL3
0 ABA 2019-01-01 100
2 ABA 2019-01-02 140
4 ABA 2019-01-03 200
Explaining the modifications: 解释修改:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.