简体   繁体   English

重用hashlib.md5可以计算相同字符串的不同值

[英]Reusing hashlib.md5 calculates different values for identical strings

This is my first test code: 这是我的第一个测试代码:

   import hashlib
   md5Hash = hashlib.md5()
   md5Hash.update('Coconuts')
   print md5Hash.hexdigest()

   md5Hash.update('Apples')
   print md5Hash.hexdigest()

   md5Hash.update('Oranges')
   print md5Hash.hexdigest()

And this is my second chunk of code: 这是我的第二大块代码:

    import hashlib
    md5Hash = hashlib.md5()
    md5Hash.update('Coconuts')
    print md5Hash.hexdigest()

    md5Hash.update('Bananas')
    print md5Hash.hexdigest()

    md5Hash.update('Oranges')
    print md5Hash.hexdigest()

But the output for 1st code is: 但第一个代码的输出是:

    0e8f7761bb8cd94c83e15ea7e720852a
    217f2e2059306ab14286d8808f687abb
    4ce7cfed2e8cb204baeba9c471d48f07

And for the second code is: 第二个代码是:

   0e8f7761bb8cd94c83e15ea7e720852a
   a82bf69bf25207f2846c015654ae68d1
   47dba619e1f3eaa8e8a01ab93c79781e

I replaced the second string from 'Apples' to 'Bananas' and the third string still remains same. 我将第二个字符串从'Apples'替换为'Bananas',第三个字符串仍然保持不变。 But still I am getting a different result for third string. 但是我仍然得到第三个字符串的不同结果。 Hashing supposed to have a same result everytime. Hashing应该每次都有相同的结果。 Am I missing something? 我错过了什么吗?

Because you're using update method, md5Hash object is reused for the 3 strings. 因为您正在使用update方法, md5Hash对象将重用于3个字符串。 So it's basically the hash of the 3 strings concatenated together. 所以它基本上是连接在一起的3个字符串的哈希值。 So changing the second string changes the outcome for the 3rd print as well. 因此,更改第二个字符串也会改变第三个字符串的结果。

You need to declare a separate md5 object for each string. 您需要为每个字符串声明一个单独的md5对象。 Use a loop (and python 3 compliant code needs the bytes prefix BTW, and also works in python 2): 使用循环(符合python 3的代码需要字节前缀BTW,并且也适用于python 2):

import hashlib
for s in (b'Coconuts',b'Bananas',b'Oranges'):
    md5Hash = hashlib.md5(s)  # no need for update, pass data at construction
    print(md5Hash.hexdigest())

result: 结果:

0e8f7761bb8cd94c83e15ea7e720852a
1ee31b77d0697c36914b99d1428f7f32
62f2b77089fea4c595e895901b63c10b

note that the values are now different, but at least it is the MD5 of each string, computed independently. 请注意,这些值现在不同,但至少它是每个字符串的MD5,独立计算。

hashlib.md5.update() adds data to the hash. hashlib.md5.update() 数据添加到哈希。 It doesn't replace the existing values; 它不会取代现有的价值; if you want to hash a new value, you need to initialize a new hashlib.md5 object. 如果要哈希一个新值,则需要初始化一个新的hashlib.md5对象。

The values you're hashing are: 您正在散列的值是:

"Coconuts"               -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsApples"         -> 217f2e2059306ab14286d8808f687abb
"CoconutsApplesOranges"  -> 4ce7cfed2e8cb204baeba9c471d48f07

"Coconuts"               -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsBananas"        -> a82bf69bf25207f2846c015654ae68d1
"CoconutsBananasOranges" -> 47dba619e1f3eaa8e8a01ab93c79781e

Expected result 预期结果

What you are expecting is generally what you should be expecting from common cryptographic libraries. 您期望的通常是您应该期望从常见的加密库。 In most cryptographic libraries the hash object is reset after calling a method that finalizes the calculation such as hexdigest . 在大多数加密库中在调用完成计算的方法(例如hexdigest之后重置哈希对象 It seems that hashlib.md5 uses alternate behavior. 看来hashlib.md5使用了替代行为。

Result by hashlib.md5 hashlib.md5结果

MD5 requires the input to be padded with a 1 bit, zero or more 0 bits and the length of the input in bits. MD5要求输入以1位,0或0位填充,输入长度以位为单位。 Then the final hash value is calculated. 然后计算最终的哈希值。 hashlib.md5 internally seems to perform the final calculation using separate variables, keeping the state after hashing each string without this final padding. hashlib.md5内部似乎使用单独的变量执行最终计算,在没有此最终填充的情况下对每个字符串进行散列后保持状态。

So the result of your hashes is the concatenation of the earlier strings with the given string, followed by the correct padding, as duskwulf pointed out in his answer . 因此,哈希的结果是先前字符串与给定字符串的串联,然后是正确的填充,正如duskwulf在他的回答中指出的那样

This is correctly documented by hashlib: hashlib正确记录了这一点:

hash.digest()

Return the digest of the strings passed to the update() method so far . 返回到目前为止传递给update()方法的字符串的摘要。 This is a string of digest_size bytes which may contain non-ASCII characters, including null bytes. 这是一个digest_size字节字符串,可能包含非ASCII字符,包括空字节。

and

hash.hexdigest()

Like digest() except the digest is returned as a string of double length, containing only hexadecimal digits. digest()一样,但摘要是以双倍长度的字符串形式返回的,只包含十六进制数字。 This may be used to exchange the value safely in email or other non-binary environments. 这可用于在电子邮件或其他非二进制环境中安全地交换值。

Solution for hashlib.md5 hashlib.md5解决方案

As there doesn't seem to be a reset() method you should create a new md5 object for each separate hash value you want to create. 由于似乎没有reset()方法,您应该为要创建的每个单独的哈希值创建一个新的md5对象。 Fortunately the hash objects themselves are relatively lightweight (even if the hashing itself isn't) so this won't consume many CPU or memory resources. 幸运的是,哈希对象本身相对较轻(即使哈希本身不是这样),因此不会占用很多CPU或内存资源。

Discussion of the differences 讨论差异

For hashing itself resetting the hash in the finalizer may not make all that much sense. 对于散列本身,重置终结器中的散列可能没有那么多意义。 But it does matter for signature generation: you might want to initialize the same signature instance and then generate multiple signatures with it. 但它对签名生成很重要:您可能希望初始化相同的签名实例,然后使用它生成多个签名。 The hash function should reset so it can calculate the signature over multiple messages. 哈希函数应该重置,以便它可以计算多个消息的签名。

Sometimes an application requires a congregated hash over multiple inputs, including intermediate hash results. 有时,应用程序需要多个输入的聚合哈希,包括中间哈希结果。 In that case however a Merkle tree of hashes is used, where the intermediate hashes themselves are hashed again. 然而,在这种情况下,使用Merkle的哈希 ,其中中间的哈希本身再次进行哈希处理。

As indicated, I consider this is bad API design by the authors of hashlib. 如上所述,我认为这是hashlib作者的糟糕API设计。 For cryptographers it certainly doesn't follow the rule of least surprise. 对于密码学家来说,它肯定不遵循最不惊讶的规则。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM