md5sum shell脚本和python hashlib.md5不同

Question

我正在比较两个不同位置的两个qcow2图像文件，以查看差异。 /opt/images/file.qcow2 /mnt/images/file.qcow2

当我跑步

md5sum /opt/images/file.qcow2 
md5sum  /mnt/images/file.qcow2

两个文件的校验和相同

但是当尝试使用以下代码查找md5sum时

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(file1).hexdigest()
        md5File2 = hashlib.md5(file2).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

说校验和不一样

UPDATE文件大小可以为8 GB

Answer 1

您正在散列文件的路径，而不是内容的路径...

hashlib.md5(file1).hexdigest() # file1 = '/path/to/file.ext'

哈希内容：

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5(file1)
        md5File2 = md5(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

旁注：您可能想使用hashlib.sha1() （与unix的sha1sum ），而不是已损坏和不建议使用的md5 ...

编辑：具有各种100mB和md5与sha1基准测试在Using 100mB服务器（Atom N2800 @ 1.86GHz）上使用100mB随机文件：

┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Algorithm ┃  Buffer ┃    Time (s)   ┃
┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│    md5sum │     --- │ 0.387         │
│       MD5 │     2⁶  │ 21.5670549870 │
│       MD5 │     2⁸  │ 6.64844799042 │
│       MD5 │     2¹⁰ │ 3.12886619568 │
│       MD5 │     2¹² │ 1.82865810394 │
│       MD5 │     2¹⁴ │ 1.27349495888 │
│       MD5 │   128¹  │ 11.5235209465 │
│       MD5 │   128²  │ 1.27280807495 │
│       MD5 │   128³  │ 1.16839885712 │
│   sha1sum │    ---  │ 1.013         │
│      SHA1 │     2⁶  │ 23.4520659447 │
│      SHA1 │     2⁸  │ 7.75686216354 │
│      SHA1 │     2¹⁰ │ 3.82775402069 │
│      SHA1 │     2¹² │ 2.52755594254 │
│      SHA1 │     2¹⁴ │ 1.93437695503 │
│      SHA1 │   128¹  │ 12.9430441856 │
│      SHA1 │   128²  │ 1.93382811546 │
│      SHA1 │   128³  │ 1.81412386894 │
└───────────┴─────────┴───────────────┘

因此md5sum比sha1sum更快，并且python的实现也显示出相同的结果。 具有更大的缓冲区可以提高性能，但要在一个限制之内（ 16384似乎是一个很好的权衡（不太大且不太有效））。

Answer 2

尝试这个：

from hashlib import md5

def md5File(filename):
    hasher = md5()
    blockSize = 16 * 1024 * 1024

    with open(filename, 'rb') as f:
        while True:
            fileBuffer = f.read(blockSize)
            if not fileBuffer:
                break

            hasher.update(fileBuffer)

    return hasher.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5File(file1)
        md5File2 = md5File(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    return md5File1 == md5File

当您只执行hashlib.md5(file1).hexdigest() ，您实际上只是在md5输入文件名。 您实际上想对内容进行md5处理，这需要使用Python文件操作打开和读取文件。 我上面发布的方法可以散列大文件，而无需将整个内容读入内存。

Answer 3

如何使用以下代码：

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(open(file1,"rb").read()).hexdigest()
        md5File2 = hashlib.md5(open(file2,"rb").read()).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

请注意，这对于较小的文件非常有用。 如果文件很大，最好像上面给出的示例一样逐块读取文件。 对于这种情况，可以使用以下代码：

import time
import hashlib
import time
with open("Some_Very_Large_File", "rb") as f:
    hasher = hashlib.md5()
    a = time.time()
    while True:
        data = f.read(3 * 1024)
        if not data:
            break
        hasher.update(data)
    print hasher.hexdigest()
    b = time.time()
    print "Done hashing in ", b - a, " seconds"

以下是我观察到的基准：

3.26GB media file and calculated the hash in 11.26 sec.
4.8GB file and hash calculated in 16.47 sec.
10.8GB file and hash calculated in 102.36 sec.

请尝试代码，并让我知道。

md5sum shell脚本和python hashlib.md5不同

问题描述

3 个解决方案

解决方案1
4 已采纳 2016-07-04 09:48:43

解决方案2
1 2016-07-04 09:49:05

解决方案3
1 2016-07-04 10:06:31

md5sum shell脚本和python hashlib.md5不同

问题描述

3 个解决方案

解决方案1 4 已采纳 2016-07-04 09:48:43

解决方案2 1 2016-07-04 09:49:05

解决方案3 1 2016-07-04 10:06:31

解决方案1
4 已采纳 2016-07-04 09:48:43

解决方案2
1 2016-07-04 09:49:05

解决方案3
1 2016-07-04 10:06:31