Strange python's hashlib.md5 behavior, different hash each time

Question

I've faced some really strange behavior trying to calculate md5 hash of string. Returned hash is always wrong (and different) if I pass string that was result of concatenation. Only way to get real hash I've found is to pass string that wasn't modified in any way after creation.

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> m = hashlib.md5() 
>>> a1 = "stack"
>>> a2 = "overflow"
>>> a3 = a1 + a2
>>> a4 = str(a1 + a2)
>>> m.update("stackoverflow")
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc' //actuall hash
>>> m.update(a1 + a2)
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m.update(a3)
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m.update(a4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'acd4e14145d34dcb10af785badf8e73e'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'03c06ca09faa26166f1096db02272b11'
>>> a1 + a2 == a1 + a2
True
>>> a1 + a2 == a3
True
>>> a3 == a4
True

Am I missing something?

Answer 1

What you are missing is that hash.update() doesn't replace the hashed data . You are continually updating the hash object, so you are getting the hash of the concatenated strings . From the hashlib.hash.update() documentation :

Update the hash object with the string arg . Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a) ; m.update(b) is equivalent to m.update(a+b) .

Bold emphasis mine.

So you are not getting the hash of a single 'stackoverflow' string, you are getting the hash first of 'stackoverflow' , then of 'stackoverflowstackoverflow' , then 'stackoverflowstackoverflowstackoverflow' etc., each time appending another 'stackoverflow' creating a longer and longer string. None of those longer strings are equal to the original short string so their hashes are not likely to be equal either.

Create a new object for new strings, instead:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update('stack' + 'overflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()   # **new** hash object
>>> m.update('stackoverflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()     # new object again
>>> m.update('stack')     # add the string in pieces, part 1
>>> m.update('overflow')  # and part 2
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You can readily produce your 'wrong' hashes by sending in concatenated data:

>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflow')
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflowstackoverflow')
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m = hashlib.md5()
>>> m.update('stackoverflow' * 4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'

Note that you can also pass in the first string into the md5() function:

>>> hashlib.md5('stackoverflow').hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You normally use the hash.update() method only if you are processing data in chunks (like reading a file line by line or reading blocks of data from a socket), and don't want to have to hold all of that data in memory at once.

Strange python's hashlib.md5 behavior, different hash each time

Question

1 answers

solution1
9 ACCPTED 2017-04-29 15:21:43

Strange python's hashlib.md5 behavior, different hash each time

Question

1 answers

solution1 9 ACCPTED 2017-04-29 15:21:43

solution1
9 ACCPTED 2017-04-29 15:21:43