简体   繁体   中英

md5sum not matching python generated md5

I have this bizarre problem where my md5 hash from a streamed file does not match md5sum . The weird thing is if I read the file in and write it out to a second file, the python md5 and md5sum second_file.txt agree. Here's the hash code:

import hashlib 
import sys

file_hash = hashlib.md5()
with open(sys.argv[1], 'r') as f, open(sys.argv[2], 'w') as w:
    while True:
        c = f.read(1)
        w.write(c)
        file_hash.update(c.encode(encoding='utf-8'))

        if c == '':
            # end of file
            break

print(file_hash.hexdigest())

Both files are in UTF-8 and running in a docker container. I'm kind of at a loss here. Any ideas?

open the file in "rb" mode to get the raw bytes, and skip the encode bit... you are effectively changing the bytes that md5 is comparing when doing this

In general the problem could be python or the md5sum function from linux, hence it would be preferred if you provide the linux command line that shows the different hashes. In my experience this most likely happens when one attempts pipe from "echo" but forgets that "echo" adds a newline character to whatever it echo's.

For example, these DO NOT match:

>> echo 'thing' | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"

Use "printf" to prevent the newline from being added. These DO match:

>> printf 'thing' | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"

You can also place the data in a file:

>> printf 'thing' > temp
>> cat temp | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM