按列排序，只保留第1行，直到第1列中的下一个值

Question

I have a file with roughly 10m lines. 我有一个大约10米行的文件。 Each line is most likely unique, but I'm sorting the file by column 1 then 2 then 3. 每一行很可能都是唯一的，但我按照第1列然后是2然后3对文件进行排序。

Column 1 = CODE
Column 2 = DATE
Column 3 = AMOUNT

I only want to keep the first line until the next date and so on. 我只想保留第一行直到下一个日期，依此类推。 Below is an example of what I have and what I need the output to be. 下面是我所拥有的以及我需要输出的示例。

Original:  
COL1   COL2         COL3  
ABA    2019-01-01   100  
ABA    2019-01-01   111  
ABA    2019-01-02   140  
ABA    2019-01-02   150  
ABA    2019-01-03   200  
ABA    2019-01-03   220  

Ouptut needed:  
COL1   COL2         COL3  
ABA    2019-01-01   100  
ABA    2019-01-02   140  
ABA    2019-01-03   200

Anyone able to help me. 任何人都能帮助我。 Have tried 试过

a.drop_duplicates(subset[data.columns[0],data.columns[1],data.columns[2]], keep='first')

Answer 1

尝试groupby然后先：

a.groupby([data.columns[0],data.columns[1]], as_index=False).first()

Answer 2

Your solution is almost correct. 您的解决方案几乎正确。 This version is a modified version: 此版本是修改后的版本：

>> a.drop_duplicates(subset = [a.columns[0],a.columns[1]], keep='first')

That produces: 这产生：

    COL1    COL2        COL3
0   ABA     2019-01-01  100
2   ABA     2019-01-02  140
4   ABA     2019-01-03  200

Explaining the modifications: 解释修改：

subset is a named parameter, as you can see on the documentation of drop_duplicates ; subset是一个命名参数，你可以在drop_duplicates的文档中看到 ;
if column 3 can vary, it shouldn't be present on the subset parameter. 如果第3列可以变化，则它不应出现在子集参数上。 The duplicate should consider the first 2 columns; 副本应考虑前两列;
the names you used in the code are not consistent, naming a and data for apparently the same object; 您在代码中使用的名称不一致，为明显相同的对象命名a和数据;

按列排序，只保留第1行，直到第1列中的下一个值

问题描述

2 个解决方案

解决方案1
2 2019-03-19 17:27:40

解决方案2
1 已采纳 2019-03-19 17:34:39

按列排序，只保留第1行，直到第1列中的下一个值

问题描述

2 个解决方案

解决方案1 2 2019-03-19 17:27:40

解决方案2 1 已采纳 2019-03-19 17:34:39

解决方案1
2 2019-03-19 17:27:40

解决方案2
1 已采纳 2019-03-19 17:34:39