简体   繁体   English

如何从文件中删除重复项?

[英]How to remove duplicates from a file?

How to remove duplicates from a large file of large numbers ? 如何从大量的大文件中删除重复项? This is an interview question about algorithms and data structures rather than sort -u and stuff like that. 这是关于算法和数据结构的访谈问题,而不是关于sort -u东西。

I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort. 我认为那里的文件不适合内存,并且数字范围足够大,因此我无法使用内存中的计数/存储桶排序。

The only option is see is to sort the file (eg merge sort ) and pass the sorted file again to filter out duplicates. 唯一的选择是对文件进行排序(例如merge sort ),然后再次传递已排序的文件以过滤出重复项。

Does it make sense. 是否有意义。 Are there other options? 还有其他选择吗?

You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (aka "union") in your mergesort. 如果您在mergesort中使用“ merge”(也称为“ union”)的重复项删除变体,则甚至不需要单独的传递排序数据。 Hash table should be empty-ish to perform well, ie be even bigger than the file itself - and we're told that the file itself is big . 哈希表应该是空的,以使其性能良好,即大于文件本身-并且被告知文件本身很大

Look up multi-way merge (eg here ) and external sorting. 查找多路合并(例如, 此处 )和外部排序。

Yes, the solution makes sense. 是的,解决方案很有意义。

An alternative is build a file-system-based hash table, and maintain it as a set. 一种替代方法是构建一个基于文件系统的哈希表,并将其维护为一个集合。 First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set. 首先迭代所有元素,然后将它们插入到您的集合中,然后在第二次迭代中,打印集合中的所有元素。

It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution. 在大O复杂度方面,取决于实现和数据的性能会更好,哈希提供O(n)时间平均情况和O(n^2)最坏情况,而合并排序选项提供更稳定的O(nlogn)解决方案。

Mergesort or Timsort (which is an improved mergesort) is a good idea. Mergesort或Timsort(这是一种改进的mergesort)是个好主意。 EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/ EG: http//stromberg.dnsalias.org/~strombrg/sort-comparison/

You might also be able to get some mileage out of a bloom filter. 您也许还可以从布隆过滤器中获得一些里程。 It's a probabilistic datastructure that has low memory requirements. 这是一个内存需求较低的概率数据结构。 You can adjust the error probability with bloom filters. 您可以使用布隆过滤器调整错误概率。 EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. EG: http : //stromberg.dnsalias.org/~strombrg/drs-bloom-filter/您可以使用一个扔出绝对唯一的值,然后通过其他方法更仔细地检查那些并非唯一的值。 。 This would be especially valuable if your input dataset has a lot of duplicates. 如果您的输入数据集有很多重复项,这将特别有价值。 It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions. 它不需要直接比较元素,而只是使用可能大量的哈希函数对元素进行哈希处理。

You could also use an on-disk BTree or 2-3 Tree or similar. 您也可以使用磁盘上的BTree或2-3树或类似的树。 These are often stored on disk, and keep key/value pairs in key order. 这些通常存储在磁盘上,并按键顺序保持键/值对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM