简体   繁体   中英

How to form String keys to get the most uniform hash code distribution

I want to store a large number of objects in a HashMap . The key to identify each object is a String which is always made up of 3 parts/substrings, for simplicity I name them A, B and C. A has high variability, B average variability and C low variability . There are multiple ways for combining the parts:

key = A + "_" + B + "_" + C;
key = A + "_" + C + "_" + B;
key = B + "_" + A + "_" + C;
...

Primarily, I would like to know how the key should be built from the substrings that have different variability/randomness in order to get the most uniform hash code distribution . Should the most random bits come first, or at the end, or...?

Secondly, I'd like to know how the length of the key influences the time to get the object from the HashMap. For example, if I double the key length does the object retrieval take twice the time? Or does the calculation of the hash code only take a fraction of that time because the process of getting the object from the HashMap's buckets takes much longer?

Bottom line: You should use the standard hashCode method provided by the String class... but NOT because the order doesn't matter.

(In fact, if you had said that C had highest variability and A had lowest, then the performance of the java.lang.String.hashCode would be horrible!)

Take Away: Given addition information about the Object 's members, the order of hashing has a substantial affect on the distribution of keys.

Normally, without any domain-specific knowledge, it's best to opt for readability and the reliability of well-established libraries for things like this. However, since you have specific insight into the distribution of your substrings, you can make a more informed decision regarding your hashFunction.

To demonstrate, suppose part A can take on any character value, part B takes on only the first 15 characters, and part C takes on only the first 5 characters. and suppose you override the hashCode method in the following way:

@Override
public int hashCode(){
    final int constant = 37;
    final String partA = getPartA(myString);
    final String partB = getPartB(myString);
    final String partC = getPartC(myString);
    int total = 17;
    total= total * constant + partA;
    total= total * constant + partB;
    total= total * constant + partC;

    return total;

}

We would expect a near uniform random distribution of strings from this method. However, if we were to reverse the following lines:

    total= total * constant + partC; //formerly part A
    total= total * constant + partB;
    total= total * constant + partA; //formerly part C

we would only generate codes in the first half of the value range. Here's some experimental results tested on 15,000 random strings that meet my stated assumptions above.

HashCode distribution when computed as A then B then C: HashCode分布,当计算为A然后B然后C

HashCode distribution when computed as C then B then A: HashCode分布,当计算为C然后是B然后是A

Are you making the key just for the sake of using it in a HashMap ? If so, then you don't even have to make it. You can put your object directly in a HashMap , but you must override the methods hashCode() and equals() .

The good news is -- your IDE (eg Eclipse ) can generate suggested code for hashCode() and equals() for you. (In Eclipse, Source > Generate hashCode() and equals() ... ). You can take its suggestion from there.

See my example code below.

I tend to think the computation is really fast. But if you have concerns about the speed, and if the three fields/parts/substrings are immutable, then you can compute the hashCode in the constructor, as I have done in my example code.

The speed of accessing elements from a hashmap depends on the load factor (ie how full is the hashmap). If the hashmap is lightly loaded (most buckets has zero or one elements in it), you get almost constant time O(1) for access. If the hashmap is heavily loaded (most buckets has many elements), then the performance would slow down significantly.

Example Code

package StringKeyForHashMap;

import java.util.HashMap;
import java.util.Map;

public class Thing {
    private final String    a;
    private final String    b;
    private final String    c;
    private final int       hashCode;


    public Thing(String a, String b, String c) {
        super();
        this.a = a;
        this.b = b;
        this.c = c;
        this.hashCode = computeHashCode();
    }


    @Override
    public int hashCode() {
        return this.hashCode;
    }

    private int computeHashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((a == null) ? 0 : a.hashCode());
        result = prime * result + ((b == null) ? 0 : b.hashCode());
        result = prime * result + ((c == null) ? 0 : c.hashCode());
        return result;
    }


    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Thing other = (Thing) obj;
        if (a == null) {
            if (other.a != null)
                return false;
        } else if (!a.equals(other.a))
            return false;
        if (b == null) {
            if (other.b != null)
                return false;
        } else if (!b.equals(other.b))
            return false;
        if (c == null) {
            if (other.c != null)
                return false;
        } else if (!c.equals(other.c))
            return false;
        return true;
    }


    public static void main(String[] args) {
        /*
         * Below I assume that the value of interest is 
         * an integer
         */
        Map<Thing, Integer> map = new HashMap<>();  
        map.put(new Thing("AAA", "BBB", "CCC"), 0);
    }

}

Whether a String has high variability at the beginning of the string vs at the end of the string doesn't matter.

To test this, the below code simulates the hash-table logic of Java 8's HashMap class. The methods tableSizeFor and hash were copied from the JDK source code.

The code will create 60 character strings that differ only by the first or last 7 characters. It will then build a hash-table with appropriate capacity and count the number of hash bucket collisions.

As can be seen in the output, the collision counts are the same (within statistical margins), regardless of leading or trailing variability of the strings being hashed.

Output

Count: 1000      Collisions: 384      By collision size: {1=240, 2=72}
Count: 1000      Collisions: 278      By collision size: {1=191, 2=30, 3=3, 4=3, 6=1}
Count: 100000    Collisions: 13876    By collision size: {1=12706, 2=579, 3=4}
Count: 100000    Collisions: 15742    By collision size: {1=12644, 2=1378, 3=110, 4=3}
Count: 10000000  Collisions: 2705759  By collision size: {1=1703714, 2=381705, 3=65050, 4=9417, 5=1038, 6=101, 7=3}
Count: 10000000  Collisions: 2626728  By collision size: {1=1698957, 2=365663, 3=56156, 4=6278, 5=535, 6=27, 7=4}

Test Code

public class Test {
    public static void main(String[] args) throws Exception {
        //
        test(1000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(1000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
        test(100000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(100000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
        test(10000000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(10000000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
    }
    private static void test(int count, String format) {
        // Allocate hash-table
        final int initialCapacity = count * 4 / 3 + 1;
        final int tableSize = tableSizeFor(initialCapacity);
        int[] tab = new int[tableSize];

        // Build strings, calculate hash bucket, and increment bucket counter
        for (int i = 0; i < count; i++) {
            String key = String.format(format, i);
            int hash = hash(key);
            int bucket = (tableSize - 1) & hash;
            tab[bucket]++;
        }

        // Collect collision counts, i.e. counts > 1
        // E.g. a bucket count of 3 means 1 original value plus 2 collisions
        int total = 0;
        Map<Integer, AtomicInteger> collisions = new TreeMap<>();
        for (int i = 0; i < tableSize; i++)
            if (tab[i] > 1) {
                total += tab[i] - 1;
                collisions.computeIfAbsent(tab[i] - 1, c -> new AtomicInteger()).incrementAndGet();
            }

        // Print result
        System.out.printf("Count: %-8d  Collisions: %-7d  By collision size: %s%n", count, total, collisions);
    }
    static final int MAXIMUM_CAPACITY = 1 << 30;
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
}

The order doesn't effect the distribution of the hashkey. All characters have the same "weight".

The longer the key, the more time it takes to calculate hash, BUT String reuse the hashCode once it's created, therefore if you reuse the same String, hashCode is generated only once.

Having said that, I would suggest you change your implementation:

  1. Create immutable class thats accepts A,B,C in constructor and calculate the hash value in constructor.
  2. Make hashCode return the hash value from constructor.
  3. If possible, reuse the instances of the class, so you don't need to recalculate the hashcode each time the map is accessed.
  4. Don't forget to override equals.

Even if you don't reuse the object, it's a better approach since it encapsulates the hash logic. But the real benefit comes if the object is reused.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM