简体   繁体   中英

The optimal way to remove duplicates from a list of sorted very large files (200G each)?

  • Other previously asked questions did not answer my question!

I have a series of large files (200 G ) each and each file is sorted and contain duplicates which look like this:

 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100| b.ac
 50.21.180.100| b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100| c.ac
 50.21.180.100| c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100| d.ac

Expected output :

50.21.180.100|a.ac
50.21.180.100|b.ac
50.21.180.100|c.ac
50.21.180.100|d.ac

Does any body have any suggestion of the most optimal way (time and memory wise) of removing these duplicates? wether it is with Linux bash or Python or other languages?

首先删除空间,然后运行uniq:

cat infile.txt | tr -d " " | uniq > outfile.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM