简体   繁体   English

按列排序,只保留第1行,直到第1列中的下一个值

[英]Sort by columns and only keep the first line until next value in column 1

I have a file with roughly 10m lines. 我有一个大约10米行的文件。 Each line is most likely unique, but I'm sorting the file by column 1 then 2 then 3. 每一行很可能都是唯一的,但我按照第1列然后是2然后3对文件进行排序。

Column 1 = CODE
Column 2 = DATE
Column 3 = AMOUNT

I only want to keep the first line until the next date and so on. 我只想保留第一行直到下一个日期,依此类推。 Below is an example of what I have and what I need the output to be. 下面是我所拥有的以及我需要输出的示例。

Original:  
COL1   COL2         COL3  
ABA    2019-01-01   100  
ABA    2019-01-01   111  
ABA    2019-01-02   140  
ABA    2019-01-02   150  
ABA    2019-01-03   200  
ABA    2019-01-03   220  

Ouptut needed:  
COL1   COL2         COL3  
ABA    2019-01-01   100  
ABA    2019-01-02   140  
ABA    2019-01-03   200  

Anyone able to help me. 任何人都能帮助我。 Have tried 试过

a.drop_duplicates(subset[data.columns[0],data.columns[1],data.columns[2]], keep='first')

尝试groupby然后先:

a.groupby([data.columns[0],data.columns[1]], as_index=False).first()

Your solution is almost correct. 您的解决方案几乎正确。 This version is a modified version: 此版本是修改后的版本:

>> a.drop_duplicates(subset = [a.columns[0],a.columns[1]], keep='first')

That produces: 这产生:

    COL1    COL2        COL3
0   ABA     2019-01-01  100
2   ABA     2019-01-02  140
4   ABA     2019-01-03  200

Explaining the modifications: 解释修改:

  1. subset is a named parameter, as you can see on the documentation of drop_duplicates ; subset是一个命名参数,你可以在drop_duplicates的文档中看到 ;
  2. if column 3 can vary, it shouldn't be present on the subset parameter. 如果第3列可以变化,则它不应出现在子集参数上。 The duplicate should consider the first 2 columns; 副本应考虑前两列;
  3. the names you used in the code are not consistent, naming a and data for apparently the same object; 您在代码中使用的名称不一致,为明显相同的对象命名a和数据;

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将大文件拆分为 n 个文件,保留前 7 列 + 后 3 列,直到第 n 列 - Split huge file into n files keeping first 7 columns + next 3 columns until column n 根据[最后一行,第一列]中的值对DataFrame列进行排序 - Sort DataFrame columns according to value in [last row, first column] 如何将文本文件的第一行作为键,下一行作为值,之后的行作为第二个值,直到空行,然后再次重复 - how to arrange first line of text file as key, next line as value, and the line afterwards as second value until empty line, then repeat again 如何读取此列中的前 5 行并跳到第 n 行并再次读取接下来的 5 行,直到到达列数据的末尾? - How do I read the first 5 lines in this column and skip to the nth line and read the next 5 lines again until I reach the end of the column data? 在第一个逗号之前按字段排序,但按原样保留文件的第一行 - Sort by field before the first comma but keep the first line of the file as is 具有多个具有相同列名的列的数据框,如何只保留第一列并删除其余列? - a dataframe with several columns having the same column name, how to only keep the first and drop the rest? 分割行,并在每行中保留前几列 - split line and keep first few columns with each new line 合并后只保留第一个匹配列的值,对于 rest 它可以是 0.0 - only keep the value of a column in first match after merging, for rest it can be 0.0 Pandas 如何从一列创建重复列表,并且只保留对应列的最大值? - Pandas How do I create a list of duplicates from one column, and only keep the highest value for the corresponding columns? 在两列中查找包含重复数据的行,仅保留第三列中值最低的那一行 - Finding rows with doublicated data in two columns and only keep the one where value in third column is lowest
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM