The optimal way to remove duplicates from a list of sorted very large files (200G each)?

Question

Other previously asked questions did not answer my question!

I have a series of large files (200 G ) each and each file is sorted and contain duplicates which look like this:

 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100| b.ac
 50.21.180.100| b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100| c.ac
 50.21.180.100| c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100| d.ac

Expected output :

50.21.180.100|a.ac
50.21.180.100|b.ac
50.21.180.100|c.ac
50.21.180.100|d.ac

Does any body have any suggestion of the most optimal way (time and memory wise) of removing these duplicates? wether it is with Linux bash or Python or other languages?

Answer 1

首先删除空间，然后运行uniq：

cat infile.txt | tr -d " " | uniq > outfile.txt

The optimal way to remove duplicates from a list of sorted very large files (200G each)?

Question

1 answers

solution1
2 ACCPTED 2014-12-08 09:25:31

The optimal way to remove duplicates from a list of sorted very large files (200G each)?

Question

1 answers

solution1 2 ACCPTED 2014-12-08 09:25:31

solution1
2 ACCPTED 2014-12-08 09:25:31