简体   繁体   中英

Bloom filter to remove duplicates from a stream of integers in O(n)

How to create a bloom filter to remove the duplicate elements from a stream of integers in O(n) time complexity & O(1) space complexity ? If possible, i would appreciate if some one can point me in right direction ?

I'm fairly certain it's just:

For each element:

  • Check if it exists in the bloom filter, if it does, it's likely a duplicate
  • Insert it into the bloom filter

Now there are two problems with this:

  • There is a probability of false positives
  • It's not truly O(1) space (but some people may say it is) as the size needs to be somewhat dependent on the number of (unique) elements, otherwise, the error rate will increase significantly as we increase the number of elements.

I don't believe either of these problems can be avoided given the constraints - both are integral parts of using (only) bloom filters.

If we weren't dealing with a stream, but rather a list, we could get rid of the false positives by recording all the elements picked up by the bloom filter and go through the list again checking against our candidate list instead to make sure they're actual duplicates. This is still O(n) time, but obviously not O(1) space.

Google Guava offers a bloom filter implementation.

Note that bloom filter is not enough by itself. If bloom filter claims that a number is not in it, then it's not in it. But if it claims that the number is already in it, there's a chance that it's wrong. So you need to have another datastructure there to be sure and use bloomfilter to reduce the number of lookups in that datastructure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM