How to update hashlib.md5 hasher using existing hasher in python?

Question

I have got cached instance of hasher:

m1 = hashlib.md5()
m1.update(b'very-very-long-data')
cached_sum = m1

and I would like to update external hasher with a sum cached before:

def append_cached_hash(external_hasher):
    # something like this
    external_hasher.update(cached_sum)

Unfortunately, it does not work as update() expects bytes. I could pass the same 'very-very-long-data' bytes again, but it refuses the whole idea of pre-caching md5 sum for common long-data object.

I could do something like the following:

external_hasher.update(cached_sum.hexdigest())

However, it does not produce the same needed result as:

external_hasher.update(b'very-very-long-data')

How could I implement the function above?

The same problem can be formulated differently. There are 3 big data sets and it is necessary to calculate md5 sums using python for all possible combinations. It is allowed to calculate md5 once for each data source.

m1 = hashlib.md5(b'very-big-data-1')
m2 = hashlib.md5(b'very-big-data-2')
m3 = hashlib.md5(b'very-big-data-3')

What should I write in the second parameter of the following print functions to achieve the goal?

print("sum for data 1 and data 2 is:", m1.update(m2))
print("sum for data 1 and data 3 is:", m1.update(m3))
print("sum for data 2 and data 3 is:", m2.update(m3))
print("sum for data 1, data 2 and data 3 is:", m1.update(m2.update(m3)))

Thanks in advance for your help!

Answer 1

A hashing function is a one way function that eats a variable length sequence of bytes and produces a fixed length sequence, a hash. So hashlib implementation goes along with this and doesn't provide a way of pulling out the input sequence, at least not a clear one.

IMO it also makes sense from the OOP perspective in that such a hash object represents a hash, so it could be used in it's place and passed around without unauthorized code being able to read the original input. Not sure if hashlib objects are really that secure though.

So to calculate all the combinations you need to keep the datasets available and use them directly. You can use the hash.copy method to reuse partial hashing results though, as advised in the docs:

hash.copy()

Return a copy (“clone”) of the hash object. This can be used to efficiently compute the digests of strings that share a common initial substring.

import hashlib

d1 = 'data-1'
d2 = 'data-2'
d3 = 'data-3'

h1 = hashlib.md5(d1)
# instead of hashlib.md5(d1).update(d2), or hashlib.md5(d1 + d2)
h12 = h1.copy()
h12.update(d2)
# instead of hashlib.md5(d1).update(d3), or hashlib.md5(d1 + d3)
h13 = h1.copy()
h13.update(d3)

h2 = hashlib.md5(d2)
# instead of hashlib.md5(d2).update(d1), or hashlib.md5(d2 + d1)
h21 = h2.copy()
h21.update(d1)

# ...

What about hashing a sum of the partial hashes, would that be of use to you?

How to update hashlib.md5 hasher using existing hasher in python?

Question

1 answers

solution1
4 ACCPTED 2014-07-11 07:35:43

How to update hashlib.md5 hasher using existing hasher in python?

Question

1 answers

solution1 4 ACCPTED 2014-07-11 07:35:43

solution1
4 ACCPTED 2014-07-11 07:35:43