从已排序的超大文件（每个200G）列表中删除重复项的最佳方法？

Question

Other previously asked questions did not answer my question! 其他先前提出的问题没有回答我的问题！

I have a series of large files (200 G ) each and each file is sorted and contain duplicates which look like this: 我每个都有一系列大文件（200 G），每个文件都经过排序，并包含如下所示的重复项：

 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100|a.ac
 50.21.180.100| b.ac
 50.21.180.100| b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100|b.ac
 50.21.180.100| c.ac
 50.21.180.100| c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100|c.ac
 50.21.180.100| d.ac

Expected output : 预期产量：

50.21.180.100|a.ac
50.21.180.100|b.ac
50.21.180.100|c.ac
50.21.180.100|d.ac

Does any body have any suggestion of the most optimal way (time and memory wise) of removing these duplicates? 是否有任何机构建议删除这些重复项的最佳方法（在时间和记忆方面）？ wether it is with Linux bash or Python or other languages? 是Linux bash还是Python或其他语言？

Answer 1

首先删除空间，然后运行uniq：

cat infile.txt | tr -d " " | uniq > outfile.txt

从已排序的超大文件（每个200G）列表中删除重复项的最佳方法？

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-12-08 09:25:31

从已排序的超大文件（每个200G）列表中删除重复项的最佳方法？

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-12-08 09:25:31

解决方案1
2 已采纳 2014-12-08 09:25:31