C# Dictionary Memory Management

Question

I have a Dictionary<string,int> that has the potential to contain upwards of 10+ million unique keys. I am trying to reduce the amount of memory that this takes, while still maintaining the functionality of the dictionary.

I had the idea of storing a hash of the string as a long instead, this decreases the apps memory usage to an acceptable amount (~1.5 gig to ~.5 gig), but I don't feel very good about my method for doing this.

long longKey=
BitConverter.ToInt64(cryptoTransformSHA1.ComputeHash(enc.GetBytes(strKey)), 0);

Basically this chops off the end of a SHA1 hash, and puts the first chunk of it into a long, which I then use as a key. While this works, at least for the data I'm testing with, I don't feel like this is a very reliable solution due to the increased possibility for key collisions.

Are there any other ways of reducing the Dictionary's memory footprint, or is the method I have above not as horrible as I think it is?

[edit] To clarify, I need to maintain the ability to lookup a value contained in the Dictionary using a string. Storing the actual string in the dictionary takes way to much memory. What I would like to do instead is to use a Dictionary<long,int> where the long is the result of a hashing function on the string.

Answer 1

So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.

So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.

For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.

Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.

I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.

Answer 2

With 10 million-odd records, have you considered using a database with a non-clustered index? Databases have a lot more tricks up their sleeve for this type of thing.

Hashing, by definition, and under any algorithm, has the potential of collisions - especially with high volumes. Depending on the scenario, I'd be very cautious of this.

Using the strings might take space, but it is reliable... if you are on x64 this needn't be too large (although it definitely counts as "big" ;-p)

Answer 3

By the way, cryptographic hashes / hash functions are exceptionally bad for dictionaries. They're big and slow. By solving the one problem (size) you've only introduced another, more severe problem: the function won't spread the input evenly any longer, thus destroying the single most important property of a good hash for approaching collision-free addressing (as you seem to have noticed yourself).

/EDIT: As Andrew has noted, GetHashCode is the solution for this problem since that's its intended use. And like in a true dictionary, you will have to work around collisions. One of the best schemes for that is double hashing . Unfortunately, the only 100% reliable way will be to actually store the original values. Else, you'd have created an infinite compression, which we know can't exist.

Answer 4

你为什么不用GetHashCode()来获取字符串的哈希？

Answer 5

With hashtable implementations I have worked with in the past, the hash brings you to a bucket which is often a link list of other objects that have the same hash. Hashes are not unique, but they are good enough to split your data up into very manageable lists (sometimes only 2 or 3 long) that you can then search though to find your actual item.

The key to a good hash is not its uniqueness, but its speed and distribution capabilities... you want it to distribute as evenly as possible.

Answer 6

Just go get SQLite. You're not likely to beat it, and even if you do, it probably won't be worth the time/effort/complexity.

SQLite.

C# Dictionary Memory Management

Question

6 answers

solution1
11 ACCPTED 2008-12-18 22:20:26

solution2
7 2008-12-18 21:21:31

solution3
5 2008-12-18 20:44:18

solution4
3 2008-12-18 20:41:36

solution5
2 2008-12-18 20:44:07

solution6
2

C# Dictionary Memory Management

Question

6 answers

solution1 11 ACCPTED 2008-12-18 22:20:26

solution2 7 2008-12-18 21:21:31

solution3 5 2008-12-18 20:44:18

solution4 3 2008-12-18 20:41:36

solution5 2 2008-12-18 20:44:07

solution6 2

solution1
11 ACCPTED 2008-12-18 22:20:26

solution2
7 2008-12-18 21:21:31

solution3
5 2008-12-18 20:44:18

solution4
3 2008-12-18 20:41:36

solution5
2 2008-12-18 20:44:07

solution6
2