I want to store a large number of objects in a HashMap . The key to identify each object is a String which is always made up of 3 parts/substrings, for simplicity I name them A, B and C. A has high variability, B average variability and C low variability . There are multiple ways for combining the parts:
key = A + "_" + B + "_" + C;
key = A + "_" + C + "_" + B;
key = B + "_" + A + "_" + C;
...
Primarily, I would like to know how the key should be built from the substrings that have different variability/randomness in order to get the most uniform hash code distribution . Should the most random bits come first, or at the end, or...?
Secondly, I'd like to know how the length of the key influences the time to get the object from the HashMap. For example, if I double the key length does the object retrieval take twice the time? Or does the calculation of the hash code only take a fraction of that time because the process of getting the object from the HashMap's buckets takes much longer?
Bottom line: You should use the standard hashCode
method provided by the String
class... but NOT because the order doesn't matter.
(In fact, if you had said that C had highest variability and A had lowest, then the performance of the java.lang.String.hashCode
would be horrible!)
Take Away: Given addition information about the Object
's members, the order of hashing has a substantial affect on the distribution of keys.
Normally, without any domain-specific knowledge, it's best to opt for readability and the reliability of well-established libraries for things like this. However, since you have specific insight into the distribution of your substrings, you can make a more informed decision regarding your hashFunction.
To demonstrate, suppose part A can take on any character value, part B takes on only the first 15 characters, and part C takes on only the first 5 characters. and suppose you override the hashCode
method in the following way:
@Override
public int hashCode(){
final int constant = 37;
final String partA = getPartA(myString);
final String partB = getPartB(myString);
final String partC = getPartC(myString);
int total = 17;
total= total * constant + partA;
total= total * constant + partB;
total= total * constant + partC;
return total;
}
We would expect a near uniform random distribution of strings from this method. However, if we were to reverse the following lines:
total= total * constant + partC; //formerly part A
total= total * constant + partB;
total= total * constant + partA; //formerly part C
we would only generate codes in the first half of the value range. Here's some experimental results tested on 15,000 random strings that meet my stated assumptions above.
Are you making the key just for the sake of using it in a HashMap
? If so, then you don't even have to make it. You can put your object directly in a HashMap
, but you must override the methods hashCode()
and equals()
.
The good news is -- your IDE (eg Eclipse
) can generate suggested code for hashCode()
and equals()
for you. (In Eclipse, Source
> Generate hashCode() and equals() ...
). You can take its suggestion from there.
See my example code below.
I tend to think the computation is really fast. But if you have concerns about the speed, and if the three fields/parts/substrings are immutable, then you can compute the hashCode in the constructor, as I have done in my example code.
The speed of accessing elements from a hashmap depends on the load factor (ie how full is the hashmap). If the hashmap is lightly loaded (most buckets has zero or one elements in it), you get almost constant time O(1) for access. If the hashmap is heavily loaded (most buckets has many elements), then the performance would slow down significantly.
Example Code
package StringKeyForHashMap;
import java.util.HashMap;
import java.util.Map;
public class Thing {
private final String a;
private final String b;
private final String c;
private final int hashCode;
public Thing(String a, String b, String c) {
super();
this.a = a;
this.b = b;
this.c = c;
this.hashCode = computeHashCode();
}
@Override
public int hashCode() {
return this.hashCode;
}
private int computeHashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((a == null) ? 0 : a.hashCode());
result = prime * result + ((b == null) ? 0 : b.hashCode());
result = prime * result + ((c == null) ? 0 : c.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Thing other = (Thing) obj;
if (a == null) {
if (other.a != null)
return false;
} else if (!a.equals(other.a))
return false;
if (b == null) {
if (other.b != null)
return false;
} else if (!b.equals(other.b))
return false;
if (c == null) {
if (other.c != null)
return false;
} else if (!c.equals(other.c))
return false;
return true;
}
public static void main(String[] args) {
/*
* Below I assume that the value of interest is
* an integer
*/
Map<Thing, Integer> map = new HashMap<>();
map.put(new Thing("AAA", "BBB", "CCC"), 0);
}
}
Whether a String
has high variability at the beginning of the string vs at the end of the string doesn't matter.
To test this, the below code simulates the hash-table logic of Java 8's HashMap
class. The methods tableSizeFor
and hash
were copied from the JDK source code.
The code will create 60 character strings that differ only by the first or last 7 characters. It will then build a hash-table with appropriate capacity and count the number of hash bucket collisions.
As can be seen in the output, the collision counts are the same (within statistical margins), regardless of leading or trailing variability of the strings being hashed.
Output
Count: 1000 Collisions: 384 By collision size: {1=240, 2=72}
Count: 1000 Collisions: 278 By collision size: {1=191, 2=30, 3=3, 4=3, 6=1}
Count: 100000 Collisions: 13876 By collision size: {1=12706, 2=579, 3=4}
Count: 100000 Collisions: 15742 By collision size: {1=12644, 2=1378, 3=110, 4=3}
Count: 10000000 Collisions: 2705759 By collision size: {1=1703714, 2=381705, 3=65050, 4=9417, 5=1038, 6=101, 7=3}
Count: 10000000 Collisions: 2626728 By collision size: {1=1698957, 2=365663, 3=56156, 4=6278, 5=535, 6=27, 7=4}
Test Code
public class Test {
public static void main(String[] args) throws Exception {
//
test(1000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
test(1000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
test(100000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
test(100000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
test(10000000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
test(10000000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
}
private static void test(int count, String format) {
// Allocate hash-table
final int initialCapacity = count * 4 / 3 + 1;
final int tableSize = tableSizeFor(initialCapacity);
int[] tab = new int[tableSize];
// Build strings, calculate hash bucket, and increment bucket counter
for (int i = 0; i < count; i++) {
String key = String.format(format, i);
int hash = hash(key);
int bucket = (tableSize - 1) & hash;
tab[bucket]++;
}
// Collect collision counts, i.e. counts > 1
// E.g. a bucket count of 3 means 1 original value plus 2 collisions
int total = 0;
Map<Integer, AtomicInteger> collisions = new TreeMap<>();
for (int i = 0; i < tableSize; i++)
if (tab[i] > 1) {
total += tab[i] - 1;
collisions.computeIfAbsent(tab[i] - 1, c -> new AtomicInteger()).incrementAndGet();
}
// Print result
System.out.printf("Count: %-8d Collisions: %-7d By collision size: %s%n", count, total, collisions);
}
static final int MAXIMUM_CAPACITY = 1 << 30;
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
}
The order doesn't effect the distribution of the hashkey. All characters have the same "weight".
The longer the key, the more time it takes to calculate hash, BUT String reuse the hashCode once it's created, therefore if you reuse the same String, hashCode is generated only once.
Having said that, I would suggest you change your implementation:
Even if you don't reuse the object, it's a better approach since it encapsulates the hash logic. But the real benefit comes if the object is reused.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.