简体繁体中英

A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

原文 2012-03-17 21:17:02 2 7 java/ memory-management/ data-structures/ hashcode

I'm required to hold, in memory, and look-up through one million uniformly distributed integers. My workload is extremely look-up intensive.
My current implementation uses a HashSet (Java). I see good look-up performance, but the memory usage is not ideal (dozens of MB).
Could you think of a more efficient (memory) data structure?
Edit: The solution will need to support a small amount additions to the data stracture.

Background:
The Integers problem stated above is a simplification of the following problem:
I have a set of one million Strings (my "Dictionary"), and I want to tell whether the Dictionary contains a given string, or not.
The Dictionary is too large to fit in memory, so I'm willing to sacrifice a tiny bit of accuracy to reduce memory footprint. I'll do that by switching to a Dictionary containing each String's Hashcode value (integer), instead of the actual chars. I'm Assuming that the chance of a collision, per string, is only 1M/2^32 .

7 answers

While Jon Skeet's answer gives good savings for a small investment, I think you can do better. Since your numbers are fairly even distributed, you can use an interpolating search for faster lookups (roughly O(log log N) instead of O(log N)). For a million items, you can probably plan on around 4 comparisons instead of around 20.

If you want to do just a little more work to cut the memory (roughly) in half again, you could build it as a two-level lookup table, basically a sort of simple version of a trie.

在此输入图像描述

You'd break your (presumably) 32-bit integer into two 16-bit pieces. You'd use the first 16 bits as an index into the first level of the lookup table. At this level, you'd have 65536 pointers, one for each possible 16-bit value for that part of your integer. That would take you to the second level of the table. For this part, we'd do a binary or interpolation search between the chosen pointer, and the next one up -- ie, all the values in the second level that had that same value in the first 16 bits.

When we look in the second table, however, we already know 16 bits of the value -- so instead of storing all 32 bits of the value, we only have to store the other 16 bits of the value.

That means instead of the second level occupying 4 megabytes, we've reduced it to 2 megabytes. Along with that we need the first level table, but it's only 65536x4=256K bytes.

This will almost certainly improve speed over a binary search of the entire data set. In the worst case (using a binary search for the second level) we could have as many as 17 comparisons (1 + log ₂ 65536). The average will be better than that though -- since we have only a million items, there can only be an average of 1_000_000/65536 = ~15 items in each second-level "partition", giving approximately 1 + log ₂ (16) = 5 comparisons. Using an interpolating search at the second level might reduce that a little further, but when you're only starting with 5 comparisons, you don't have much room left for really dramatic improvements. Given an average of only ~15 items at the second level, the type of search you use won't make much difference -- even a linear search is going to be pretty fast.

Of course, if you wanted to you could go a step further and use a 4-level table instead (one for each byte in the integer). It may be open to question, however, whether that would save you enough more to be worth the trouble. At least right off, my immediate guess is that you'd be doing a fair amount of extra work for fairly minimal savings (just storing the final bytes of the million integers obviously occupies 1 megabyte, and three levels of table leading to that would clearly occupy a fair amount more, so you'd double the number of levels to save something like half a megabyte. If you're in a situation where saving just a little more would make a big difference, go for it -- but otherwise, I doubt whether the return will justify the extra investment.

Sounds like you could just keep a sorted int[] and then do a binary search. With a million values, that's ~20 comparisons to get to any value - would that be fast enough?

If you are willing to accept a small chance of a false positive in return for a large reduction in memory usage, then a Bloom filter may be just what you need.

A Bloom filter consists of k hash functions and a table of n bits, initially empty. To add an item to the table, feed it to each of the k hash functions (getting a number between 0 and n −1) and set the corresponding bit. To check if an item is in the table, feed it to each of the k hash functions and see if all corresponding k bits are set.

A Bloom filter with a 1% false positive rate requires about 10 bits per item; the false positive rate decreases rapidly as you add more bits per item.

Here's an open-source implementation in Java.

在Github项目LargeIntegerSet中，有一些用于整数的Java实现，减少了内存消耗。

你可能想看看一个BitSet Lucene中使用的那个比标准的Java实现更快，因为它忽略了一些标准的边界检查。

I think that you might reconsider original problem (having efficient word list), rather than trying to optimize the "optimalization".

I would suggest looking into Radix tree/Trie.

https://en.wikipedia.org/wiki/Radix_tree or https://en.wikipedia.org/wiki/Trie

You are basically storing some kind of tree with prefixes of strings, branching every time there is a choice in dictionary. It has some interesting side effects (allows filtering on prefixes very efficiently), can save some memory for strings with longer common prefixes and is reasonably fast.

Some example implementations:

https://lucene.apache.org/core/4_0_0/analyzers-stempel/org/egothor/stemmer/Trie.html

https://github.com/rkapsi/patricia-trie

https://github.com/npgall/concurrent-trees

There is interesting comparison of various implementations here, with bigger focus on performance rather than memory usage, but it can be still helpful

http://bhavin.directi.com/to-trie-or-not-to-trie-a-comparison-of-efficient-data-structures/

There are some IntHashSet implementations for primitives available.

Quick googling got me this one . There is also an apache [open source] implementation of IntHashSet . I'd prefer the apache implementation, though it has some overhead [it is implemented as a IntToIntMap ]

Efficient look-up in a List

Correct Data Structure for Collection with Multiple Mutable Look-Up Properties

HashSet look-up complexity?

How to find a word in large word list (vocabulary) with descent memory consumption and look-up time?

Data Structure for easy look up

Enums shared static look-up method

Struts 2 Message Resource look-up issue

Look-up equal element in TreeSet

Hadoop Distributed Cache to process large look up text file

What is an O(1)-search memory-efficient data structure to store pairs of integers?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Efficient look-up in a List Correct Data Structure for Collection with Multiple Mutable Look-Up Properties HashSet look-up complexity? How to find a word in large word list (vocabulary) with descent memory consumption and look-up time? Data Structure for easy look up Enums shared static look-up method Struts 2 Message Resource look-up issue Look-up equal element in TreeSet Hadoop Distributed Cache to process large look up text file What is an O(1)-search memory-efficient data structure to store pairs of integers?

Related Tags

A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

Question

7 answers

solution1
12 2012-03-17 22:47:11

solution2
4 2012-03-17 21:19:34

solution3
3 2012-03-18 10:19:52

solution4
1 2018-04-28 10:42:16

solution5
1 2012-03-17 21:44:16

solution6
0 2016-07-30 17:44:27

solution7
0 2012-03-17 21:27:23

A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

Question

7 answers

solution1 12 2012-03-17 22:47:11

solution2 4 2012-03-17 21:19:34

solution3 3 2012-03-18 10:19:52

solution4 1 2018-04-28 10:42:16

solution5 1 2012-03-17 21:44:16

solution6 0 2016-07-30 17:44:27

solution7 0 2012-03-17 21:27:23

solution1
12 2012-03-17 22:47:11

solution2
4 2012-03-17 21:19:34

solution3
3 2012-03-18 10:19:52

solution4
1 2018-04-28 10:42:16

solution5
1 2012-03-17 21:44:16

solution6
0 2016-07-30 17:44:27

solution7
0 2012-03-17 21:27:23