简体   繁体   中英

Calculate a CRC / CRC32 hash / checksum on a binary file in Python using a buffer

I've been trying to teach myself Python so I don't fully understand what I'm doing. I'm embarrassed to say this but my question should be really easy to answer. I want to be able to do a CRC checksums on binary files with code similar to this:

# http://upload.wikimedia.org/wikipedia/commons/7/72/Pleiades_Spitzer_big.jpg

import zlib

buffersize = 65536

with open('Pleiades_Spitzer_big.jpg', 'rb') as afile:
    buffr = afile.read(buffersize)
    while len(buffr) > 0:
        crcvalue = zlib.crc32(buffr)
        buffr = afile.read(buffersize)

print(format(crcvalue & 0xFFFFFFFF, '08x'))

The correct result should be "a509ae4b" but my code's result is "dedf5161". I think what is happening is the checksum is being calculated on either the first or last 64kb of the file instead of the whole file.

How should the code be altered so it checks the entire file without loading the entire file into memory?

As it is, the code "works" in either Python 2.x or 3.x. If the code has to be in one or the other, I'd prefer it to be in 3.x.

You're currently calculating CRC of only the last chunk of the file. In order to fix this pass current crcvalue to crc32 as starting value:

import zlib

buffersize = 65536

with open('Pleiades_Spitzer_big.jpg', 'rb') as afile:
    buffr = afile.read(buffersize)
    crcvalue = 0
    while len(buffr) > 0:
        crcvalue = zlib.crc32(buffr, crcvalue)
        buffr = afile.read(buffersize)

print(format(crcvalue & 0xFFFFFFFF, '08x')) # a509ae4b

Here's the relevant part from Python docs:

If value is present, it is used as the starting value of the checksum; otherwise, a default value of 0 is used. Passing in value allows computing a running checksum over the concatenation of several inputs.

While the accepted answer by @niemmi is excellent and accurate, here is Python 3.8+ compatible solution which helps simplify the code a bit.


Python 3.8+

The sample below makes use of the walrus assignment operator ( := ) to keep track of the chunks being read:

import zlib

size = 1024*1024*10  # 10 MiB chunks
with open('/tmp/test.txt', 'rb') as f:
    crcval = 0
    while chunk := f.read(size):
        crcval = zlib.crc32(chunk, crcval)

print(f'{crcval & 0xFFFFFFFF:08x}')

Testing

echo "Some boring example text in a file." > /tmp/test.txt

$ crc32 /tmp/test.txt
2a30366b

Checksum value using the example code above:

2a30e66b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM