简体   繁体   中英

Multiplication should be suboptimal. Why is it used in hashCode?

Hash Functions are incredibly useful and versatile. In general, they are used to map a space to one much smaller space. Of course that means that two objects may hash to the same value (collision), but this is because you are reducing the space ( pigeonhole principle ). The efficiency of the function largely depends on the size of the hash space.

It comes as a surprise then that a lot of Java hashCode functions are using multiplication to produce the hash code of a new object as eg follows ( creating-a-hashcode-method-java )

@Override
public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((email == null) ? 0 : email.hashCode());
    result = prime * result + (int) (id ^ (id >>> 32));
    result = prime * result + ((name == null) ? 0 : name.hashCode());
    return result;
}

If we want to mix two hashcodes in the same range, xor should be much better than addition and is I think traditionally used. If we wanted to increase the space, shifting by some bytes and then xoring would still imho make sense. I guess multiplying by 31 is almost the same as shifting one hash by 1 and then adding but it should be much less efficient...

As it is the recommended approach though, I think I am missing something. So my question is why would this be?

Notes:

  • I am not asking why we use a prime . It is pretty clear that if we used multiplication, we should go with a prime. However multiplying by any number, even a prime, should still be suboptimal to xor. That is why eg all these other non-cryptographic hash functions - as well as most cryptographic - use xor and not multiplications...
  • I have indeed no indication (apart from all those well known hash functions) xor would be better. In fact just by the fact it is so widely accepted, I suspect it should be as good and in practice better to multiply by a prime and sum. I am asking why this is...
  • The int type in Java can be used to represent any whole number from -2147483648 to 2147483647.
  • Sometimes the hashcode of an object may be its memory address (which makes sense and is efficient in a lot of situations) ( if inherited from eg object )

The answer to this is a mixture of different factors:

  • On modern architecture, the time taken to perform a multiplication versus a shift may not end up being measurable overall within a given pipeline of instructions-- it has more to do with the availability of the relevant execution unit on the CPU than the "raw" time taken;
  • In practice when integrating with standard collections libraries in day-to-day programming, it's often more important that a hash function is correct, "good enough" and easy to automate in an IDE than for it to be as perfect as possible;
  • The collections libraries generally add secondary hash functions and potentially other techniques behind the scenes to overcome some of the weaknesses of what would otherwise be a poor hash function;
  • With resizable collections, an effective hash function has the goal of dispersing its hashes across the available range for arbitrary sizes of hash tables (though as I say, it will get help from the built-in secondary function): multiplying by a "magic" constant is often a cheap way to achieve this (or, even if multiplication turned out to be a bit more expensive than a shift: still cheap enough, given the benefit); addition rather than XOR may help to allow this 'avalanche' effect slightly. (In most practical cases, you will probably find that they work equally well.)
  • You can generally assume that the JIT compiler "knows" about equivalents such as shifting 5 places and subtracting 1 rather than multiplying by 31. Just because you write "*31" in the source code doesn't mean that it will literally be compiled to a multiplication instruction. (In practice, it might be, though, because despite what you think, the multiply instruction may well be "faster" on average on the architecture in question... It's usually better to make your code stick to the required logic and let the JIT compiler handle the low level optimisations in a case such as this.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM