简体   繁体   English

从已排序的超大文件(每个200G)列表中删除重复项的最佳方法?

[英]The optimal way to remove duplicates from a list of sorted very large files (200G each)?

  • Other previously asked questions did not answer my question! 其他先前提出的问题没有回答我的问题!

I have a series of large files (200 G ) each and each file is sorted and contain duplicates which look like this: 我每个都有一系列大文件(200 G),每个文件都经过排序,并包含如下所示的重复项:

 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100| b.ac
 50.21.180.100| b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100| c.ac
 50.21.180.100| c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100| d.ac

Expected output : 预期产量:

50.21.180.100|a.ac
50.21.180.100|b.ac
50.21.180.100|c.ac
50.21.180.100|d.ac

Does any body have any suggestion of the most optimal way (time and memory wise) of removing these duplicates? 是否有任何机构建议删除这些重复项的最佳方法(在时间和记忆方面)? wether it is with Linux bash or Python or other languages? 是Linux bash还是Python或其他语言?

首先删除空间,然后运行uniq:

cat infile.txt | tr -d " " | uniq > outfile.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM