简体   繁体   中英

A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

I'm required to hold, in memory, and look-up through one million uniformly distributed integers. My workload is extremely look-up intensive.
My current implementation uses a HashSet (Java). I see good look-up performance, but the memory usage is not ideal (dozens of MB).
Could you think of a more efficient (memory) data structure?
Edit: The solution will need to support a small amount additions to the data stracture.

Background:
The Integers problem stated above is a simplification of the following problem:
I have a set of one million Strings (my "Dictionary"), and I want to tell whether the Dictionary contains a given string, or not.
The Dictionary is too large to fit in memory, so I'm willing to sacrifice a tiny bit of accuracy to reduce memory footprint. I'll do that by switching to a Dictionary containing each String's Hashcode value (integer), instead of the actual chars. I'm Assuming that the chance of a collision, per string, is only 1M/2^32 .

While Jon Skeet's answer gives good savings for a small investment, I think you can do better. Since your numbers are fairly even distributed, you can use an interpolating search for faster lookups (roughly O(log log N) instead of O(log N)). For a million items, you can probably plan on around 4 comparisons instead of around 20.

If you want to do just a little more work to cut the memory (roughly) in half again, you could build it as a two-level lookup table, basically a sort of simple version of a trie.

在此输入图像描述

You'd break your (presumably) 32-bit integer into two 16-bit pieces. You'd use the first 16 bits as an index into the first level of the lookup table. At this level, you'd have 65536 pointers, one for each possible 16-bit value for that part of your integer. That would take you to the second level of the table. For this part, we'd do a binary or interpolation search between the chosen pointer, and the next one up -- ie, all the values in the second level that had that same value in the first 16 bits.

When we look in the second table, however, we already know 16 bits of the value -- so instead of storing all 32 bits of the value, we only have to store the other 16 bits of the value.

That means instead of the second level occupying 4 megabytes, we've reduced it to 2 megabytes. Along with that we need the first level table, but it's only 65536x4=256K bytes.

This will almost certainly improve speed over a binary search of the entire data set. In the worst case (using a binary search for the second level) we could have as many as 17 comparisons (1 + log 2 65536). The average will be better than that though -- since we have only a million items, there can only be an average of 1_000_000/65536 = ~15 items in each second-level "partition", giving approximately 1 + log 2 (16) = 5 comparisons. Using an interpolating search at the second level might reduce that a little further, but when you're only starting with 5 comparisons, you don't have much room left for really dramatic improvements. Given an average of only ~15 items at the second level, the type of search you use won't make much difference -- even a linear search is going to be pretty fast.

Of course, if you wanted to you could go a step further and use a 4-level table instead (one for each byte in the integer). It may be open to question, however, whether that would save you enough more to be worth the trouble. At least right off, my immediate guess is that you'd be doing a fair amount of extra work for fairly minimal savings (just storing the final bytes of the million integers obviously occupies 1 megabyte, and three levels of table leading to that would clearly occupy a fair amount more, so you'd double the number of levels to save something like half a megabyte. If you're in a situation where saving just a little more would make a big difference, go for it -- but otherwise, I doubt whether the return will justify the extra investment.

Sounds like you could just keep a sorted int[] and then do a binary search. With a million values, that's ~20 comparisons to get to any value - would that be fast enough?

If you are willing to accept a small chance of a false positive in return for a large reduction in memory usage, then a Bloom filter may be just what you need.

A Bloom filter consists of k hash functions and a table of n bits, initially empty. To add an item to the table, feed it to each of the k hash functions (getting a number between 0 and n −1) and set the corresponding bit. To check if an item is in the table, feed it to each of the k hash functions and see if all corresponding k bits are set.

A Bloom filter with a 1% false positive rate requires about 10 bits per item; the false positive rate decreases rapidly as you add more bits per item.

Here's an open-source implementation in Java.

在Github项目LargeIntegerSet中,有一些用于整数的Java实现,减少了内存消耗。

你可能想看看一个BitSet Lucene中使用的那个比标准的Java实现更快,因为它忽略了一些标准的边界检查。

I think that you might reconsider original problem (having efficient word list), rather than trying to optimize the "optimalization".

I would suggest looking into Radix tree/Trie.

https://en.wikipedia.org/wiki/Radix_tree or https://en.wikipedia.org/wiki/Trie

You are basically storing some kind of tree with prefixes of strings, branching every time there is a choice in dictionary. It has some interesting side effects (allows filtering on prefixes very efficiently), can save some memory for strings with longer common prefixes and is reasonably fast.

基数树示例

Some example implementations:

https://lucene.apache.org/core/4_0_0/analyzers-stempel/org/egothor/stemmer/Trie.html

https://github.com/rkapsi/patricia-trie

https://github.com/npgall/concurrent-trees

There is interesting comparison of various implementations here, with bigger focus on performance rather than memory usage, but it can be still helpful

http://bhavin.directi.com/to-trie-or-not-to-trie-a-comparison-of-efficient-data-structures/

There are some IntHashSet implementations for primitives available.

Quick googling got me this one . There is also an apache [open source] implementation of IntHashSet . I'd prefer the apache implementation, though it has some overhead [it is implemented as a IntToIntMap ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM