简体繁体 English

如何从文件中删除重复项？

[英]How to remove duplicates from a file?

原文 2012-07-20 14:03:12 7 3 algorithm/ data-structures/ language-agnostic

How to remove duplicates from a large file of large numbers ? 如何从大量的大文件中删除重复项？ This is an interview question about algorithms and data structures rather than sort -u and stuff like that. 这是关于算法和数据结构的访谈问题，而不是关于sort -u东西。

I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort. 我认为那里的文件不适合内存，并且数字范围足够大，因此我无法使用内存中的计数/存储桶排序。

The only option is see is to sort the file (eg merge sort ) and pass the sorted file again to filter out duplicates. 唯一的选择是对文件进行排序（例如merge sort ），然后再次传递已排序的文件以过滤出重复项。

Does it make sense. 是否有意义。 Are there other options? 还有其他选择吗？

3 个解决方案

You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (aka "union") in your mergesort. 如果您在mergesort中使用“ merge”（也称为“ union”）的重复项删除变体，则甚至不需要单独的传递排序数据。 Hash table should be empty-ish to perform well, ie be even bigger than the file itself - and we're told that the file itself is big . 哈希表应该是空的，以使其性能良好，即大于文件本身-并且被告知文件本身很大。

Look up multi-way merge (eg here ) and external sorting. 查找多路合并（例如，此处）和外部排序。

Yes, the solution makes sense. 是的，解决方案很有意义。

An alternative is build a file-system-based hash table, and maintain it as a set. 一种替代方法是构建一个基于文件系统的哈希表，并将其维护为一个集合。 First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set. 首先迭代所有元素，然后将它们插入到您的集合中，然后在第二次迭代中，打印集合中的所有元素。

It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution. 在大O复杂度方面，取决于实现和数据的性能会更好，哈希提供O(n)时间平均情况和O(n^2)最坏情况，而合并排序选项提供更稳定的O(nlogn)解决方案。

Mergesort or Timsort (which is an improved mergesort) is a good idea. Mergesort或Timsort（这是一种改进的mergesort）是个好主意。 EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/ EG： http ： //stromberg.dnsalias.org/~strombrg/sort-comparison/

You might also be able to get some mileage out of a bloom filter. 您也许还可以从布隆过滤器中获得一些里程。 It's a probabilistic datastructure that has low memory requirements. 这是一个内存需求较低的概率数据结构。 You can adjust the error probability with bloom filters. 您可以使用布隆过滤器调整错误概率。 EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. EG： http : //stromberg.dnsalias.org/~strombrg/drs-bloom-filter/您可以使用一个扔出绝对唯一的值，然后通过其他方法更仔细地检查那些并非唯一的值。。 This would be especially valuable if your input dataset has a lot of duplicates. 如果您的输入数据集有很多重复项，这将特别有价值。 It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions. 它不需要直接比较元素，而只是使用可能大量的哈希函数对元素进行哈希处理。

You could also use an on-disk BTree or 2-3 Tree or similar. 您也可以使用磁盘上的BTree或2-3树或类似的树。 These are often stored on disk, and keep key/value pairs in key order. 这些通常存储在磁盘上，并按键顺序保持键/值对。