简体   繁体   English

布隆过滤器从O(n)中的整数流中删除重复项

[英]Bloom filter to remove duplicates from a stream of integers in O(n)

How to create a bloom filter to remove the duplicate elements from a stream of integers in O(n) time complexity & O(1) space complexity ? 如何创建布隆过滤器以从O(n)时间复杂度和O(1)空间复杂度的整数流中删除重复元素? If possible, i would appreciate if some one can point me in right direction ? 如果可能的话,如果有人能指出正确的方向,我将不胜感激。

I'm fairly certain it's just: 我相当确定这只是:

For each element: 对于每个元素:

  • Check if it exists in the bloom filter, if it does, it's likely a duplicate 检查它是否在bloom筛选器中存在,如果存在,则可能是重复项
  • Insert it into the bloom filter 将其插入布隆过滤器

Now there are two problems with this: 现在有两个问题:

  • There is a probability of false positives 有误报的可能性
  • It's not truly O(1) space (but some people may say it is) as the size needs to be somewhat dependent on the number of (unique) elements, otherwise, the error rate will increase significantly as we increase the number of elements. 这并不是真正的O(1)空间(但有人可能会说是),因为大小需要一定程度地取决于(唯一)元素的数量,否则,错误率会随着我们增加元素数量而显着增加。

I don't believe either of these problems can be avoided given the constraints - both are integral parts of using (only) bloom filters. 考虑到约束,我不认为可以避免这些问题中的任何一个-两者都是使用(仅)bloom过滤器的组成部分。

If we weren't dealing with a stream, but rather a list, we could get rid of the false positives by recording all the elements picked up by the bloom filter and go through the list again checking against our candidate list instead to make sure they're actual duplicates. 如果我们不是处理流,而是处理列表,则可以通过记录bloom过滤器拾取的所有元素来消除误报,然后再次遍历该列表以检查我们的候选列表,以确保它们没有是实际的重复项。 This is still O(n) time, but obviously not O(1) space. 这仍然是O(n)时间,但显然不是O(1)空间。

Google Guava offers a bloom filter implementation. Google Guava提供了布隆过滤器实现。

Note that bloom filter is not enough by itself. 请注意,光晕过滤器本身是不够的。 If bloom filter claims that a number is not in it, then it's not in it. 如果Bloom filter声称其中没有数字,则该数字中也没有数字。 But if it claims that the number is already in it, there's a chance that it's wrong. 但是,如果它声称该数字已经存在,则有可能是错误的。 So you need to have another datastructure there to be sure and use bloomfilter to reduce the number of lookups in that datastructure. 因此,您需要确定另一个数据结构,并使用Bloomfilter减少该数据结构中的查找次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM