Reusing hashlib.md5 calculates different values for identical strings

Question

This is my first test code:

   import hashlib
   md5Hash = hashlib.md5()
   md5Hash.update('Coconuts')
   print md5Hash.hexdigest()

   md5Hash.update('Apples')
   print md5Hash.hexdigest()

   md5Hash.update('Oranges')
   print md5Hash.hexdigest()

And this is my second chunk of code:

    import hashlib
    md5Hash = hashlib.md5()
    md5Hash.update('Coconuts')
    print md5Hash.hexdigest()

    md5Hash.update('Bananas')
    print md5Hash.hexdigest()

    md5Hash.update('Oranges')
    print md5Hash.hexdigest()

But the output for 1st code is:

    0e8f7761bb8cd94c83e15ea7e720852a
    217f2e2059306ab14286d8808f687abb
    4ce7cfed2e8cb204baeba9c471d48f07

And for the second code is:

   0e8f7761bb8cd94c83e15ea7e720852a
   a82bf69bf25207f2846c015654ae68d1
   47dba619e1f3eaa8e8a01ab93c79781e

I replaced the second string from 'Apples' to 'Bananas' and the third string still remains same. But still I am getting a different result for third string. Hashing supposed to have a same result everytime. Am I missing something?

Answer 1

Because you're using update method, md5Hash object is reused for the 3 strings. So it's basically the hash of the 3 strings concatenated together. So changing the second string changes the outcome for the 3rd print as well.

You need to declare a separate md5 object for each string. Use a loop (and python 3 compliant code needs the bytes prefix BTW, and also works in python 2):

import hashlib
for s in (b'Coconuts',b'Bananas',b'Oranges'):
    md5Hash = hashlib.md5(s)  # no need for update, pass data at construction
    print(md5Hash.hexdigest())

result:

0e8f7761bb8cd94c83e15ea7e720852a
1ee31b77d0697c36914b99d1428f7f32
62f2b77089fea4c595e895901b63c10b

note that the values are now different, but at least it is the MD5 of each string, computed independently.

Answer 2

hashlib.md5.update() adds data to the hash. It doesn't replace the existing values; if you want to hash a new value, you need to initialize a new hashlib.md5 object.

The values you're hashing are:

"Coconuts"               -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsApples"         -> 217f2e2059306ab14286d8808f687abb
"CoconutsApplesOranges"  -> 4ce7cfed2e8cb204baeba9c471d48f07

"Coconuts"               -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsBananas"        -> a82bf69bf25207f2846c015654ae68d1
"CoconutsBananasOranges" -> 47dba619e1f3eaa8e8a01ab93c79781e

Answer 3

Expected result

What you are expecting is generally what you should be expecting from common cryptographic libraries. In most cryptographic libraries the hash object is reset after calling a method that finalizes the calculation such as hexdigest . It seems that hashlib.md5 uses alternate behavior.

Result by `hashlib.md5`

MD5 requires the input to be padded with a 1 bit, zero or more 0 bits and the length of the input in bits. Then the final hash value is calculated. hashlib.md5 internally seems to perform the final calculation using separate variables, keeping the state after hashing each string without this final padding.

So the result of your hashes is the concatenation of the earlier strings with the given string, followed by the correct padding, as duskwulf pointed out in his answer .

This is correctly documented by hashlib:

hash.digest()

Return the digest of the strings passed to the update() method so far . This is a string of digest_size bytes which may contain non-ASCII characters, including null bytes.

and

hash.hexdigest()

Like digest() except the digest is returned as a string of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.

Solution for `hashlib.md5`

As there doesn't seem to be a reset() method you should create a new md5 object for each separate hash value you want to create. Fortunately the hash objects themselves are relatively lightweight (even if the hashing itself isn't) so this won't consume many CPU or memory resources.

Discussion of the differences

For hashing itself resetting the hash in the finalizer may not make all that much sense. But it does matter for signature generation: you might want to initialize the same signature instance and then generate multiple signatures with it. The hash function should reset so it can calculate the signature over multiple messages.

Sometimes an application requires a congregated hash over multiple inputs, including intermediate hash results. In that case however a Merkle tree of hashes is used, where the intermediate hashes themselves are hashed again.

As indicated, I consider this is bad API design by the authors of hashlib. For cryptographers it certainly doesn't follow the rule of least surprise.

Reusing hashlib.md5 calculates different values for identical strings

Question

3 answers

solution1
4 2018-05-25 20:01:32

solution2
3 2018-05-25 20:02:14

solution3
0 2018-05-28 18:28:52

Expected result

Result by `hashlib.md5`

Solution for `hashlib.md5`

Discussion of the differences

Reusing hashlib.md5 calculates different values for identical strings

Question

3 answers

solution1 4 2018-05-25 20:01:32

solution2 3 2018-05-25 20:02:14

solution3 0 2018-05-28 18:28:52

Expected result

Result by hashlib.md5

Solution for hashlib.md5

Discussion of the differences

solution1
4 2018-05-25 20:01:32

solution2
3 2018-05-25 20:02:14

solution3
0 2018-05-28 18:28:52

Result by `hashlib.md5`

Solution for `hashlib.md5`