简体   繁体   English

为什么python hashlib.md5比linux coreutils md5sum快

[英]Why python hashlib.md5 is faster than linux coreutils md5sum

I just found python hashlib.md5 might be faster than coreutils md5sum. 我刚刚发现python hashlib.md5可能比coreutils md5sum快。

python hashlib python hashlib

def get_hash(fpath, algorithm='md5', block=32768):
    if not hasattr(hashlib, algorithm):
        return ''
    m = getattr(hashlib, algorithm)()
    if not os.path.isfile(fpath):
        return ''
    with open(fpath, 'r') as f:
        while True:
            data = f.read(block)
            if not data:
                break
            m.update(data)
    return m.hexdigest()

coreutils md5sum coreutils md5sum

def shell_hash(fpath, method='md5sum'):
    if not os.path.isfile(fpath):
        return ''
    cmd = [method, fpath] #delete shlex
    p = Popen(cmd, stdout=PIPE)
    output, _ = p.communicate()
    if p.returncode:
        return ''
    output = output.split()
    return output[0]

There are 4 columns about my test results time of calculate md5 and sha1. 关于计算md5和sha1的测试结果时间,共有4列。

1th column are cal time of coreutils md5sum or sha1sum. 第一列是coreutils md5sum或sha1sum的校准时间。

2th column are cal time of python hashlib md5 or sha1, by reading 1048576 chunk. 第二列是通过读取1048576块的python hashlib md5或sha1的时间。

3th column are cal time of python hashlib md5 or sha1, by reading 32768 chunk. 第三列是通过读取32768块的python hashlib md5或sha1的时间。

4th column are cal time of python hashlib md5 or sha1, by reading 512 chunk. 第四列是通过读取512块的python hashlib md5或sha1的时间。

4.08805298805 3.81827783585 3.72585606575 5.72505903244
6.28456497192 3.69725108147 3.59885907173 5.69266486168
4.08003306389 3.82310700417 3.74562311172 5.74706888199
6.25473690033 3.70099711418 3.60972714424 5.70108985901
4.07995700836 3.83335709572 3.74854302406 5.74988412857
6.26068210602 3.72050404549 3.60864400864 5.69080018997
4.08979201317 3.83872914314 3.75350999832 5.79242300987
6.28977203369 3.69586396217 3.60469412804 5.68853116035
4.0824379921 3.83340883255 3.74298214912 5.73846316338
6.27566385269 3.6986720562 3.6079480648 5.68188500404
4.10092496872 3.82357311249 3.73044300079 5.7778570652
6.25675201416 3.78636980057 3.62911510468 5.71392583847
4.09579920769 3.83730792999 3.73345088959 5.73320293427
6.26580905914 3.69428491592 3.61320495605 5.69155502319
4.09030103683 3.82516098022 3.73244214058 5.72749185562
6.26151800156 3.6951239109 3.60320997238 5.70400810242
4.07977604866 3.81951498985 3.73287010193 5.73037815094
6.26691818237 3.72077894211 3.60203289986 5.71795105934
4.08536100388 3.83897590637 3.73681998253 5.73614501953
6.2943251133 3.72131896019 3.61498594284 5.69963502884
(My computer has 4-core i3-2120 CPU @ 3.30GHz, 4G memory. 
 The file calculated by these program is about 2G in size.
 The odd rows are about md5 and the even rows are about sha1.
 The time in this table are in second.)

With more than 100 times test, I found python hashlib was always faster than md5sum or sha1sum. 经过100多次测试,我发现python hashlib总是比md5sum或sha1sum快。

I also read some docs in source code about Python2.7/Modules/{md5.c,md5.h,md5module.c} and gnulib lib/{md5.c,md5.h}. 我还在源代码中阅读了一些有关Python2.7 / Modules / {md5.c,md5.h,md5module.c}和gnulib lib / {md5.c,md5.h}的文档。 They are both implementation of MD5 (RFC 1321). 它们都是MD5(RFC 1321)的实现。

In gnulib, md5 chunk read by 32768 . 在gnulib中,md5块由32768读取。

I didn't know much about md5 and C source code. 我对md5和C源代码不太了解。 Could someone help me to explain these results? 有人可以帮我解释一下这些结果吗?

The other reason why I want to ask this question is that many people think md5sum is faster than python_hashlib for granted and they prefer to use md5sum when writting python code. 我想问这个问题的另一个原因是,很多人认为md5sum比python_hashlib更快,并且他们更喜欢在编写python代码时使用md5sum。 But it seems wrong. 但这似乎是错误的。

coreutils had it's own C implementation, whereas python calls out to libcrypto with architecture specific assembly implementations. coreutils拥有自己的C实现,而python调用具有特定于体系结构的程序集实现的libcrypto。 The difference is even greater with sha1. sha1的差异更大。 Now this has been fixed up in coreutils-8.22 (when configured --with-openssl), and is enabled in newer distos like Fedora 21, RHEL 7 and Arch, etc. 现在,此问题已在coreutils-8.22(配置为--with-openssl)中得到修复,并在Fedora 21,RHEL 7和Arch等较新的Disto中启用。

Note calling out to the command even though currently slower on some systems is a better long term strategy as one can take advantage of all the logic encapsulated within the separate commands, rather than reimplementing. 请注意,即使当前在某些系统上较慢,调用该命令也是一个较好的长期策略,因为可以利用封装在单独命令中的所有逻辑,而不是重新实现。 For example in coreutils there is pending support for improved reading of sparse files so that zeros are not redundantly read from the kernel etc. Better take advantage of that transparently if possible. 例如,在coreutils中,有待改进的稀疏文件读取支持,以便不会从内核等中冗余读取零。如果可能,更好地透明地利用它。

我不确定您是如何安排时间的,但是差异可能是由于您每次调用shell_hash花费了时间来旋转一个子shlex.split (还要考虑shlex.split的解析时间)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM