简体   繁体   中英

Why python hashlib.md5 is faster than linux coreutils md5sum

I just found python hashlib.md5 might be faster than coreutils md5sum.

python hashlib

def get_hash(fpath, algorithm='md5', block=32768):
    if not hasattr(hashlib, algorithm):
        return ''
    m = getattr(hashlib, algorithm)()
    if not os.path.isfile(fpath):
        return ''
    with open(fpath, 'r') as f:
        while True:
            data = f.read(block)
            if not data:
                break
            m.update(data)
    return m.hexdigest()

coreutils md5sum

def shell_hash(fpath, method='md5sum'):
    if not os.path.isfile(fpath):
        return ''
    cmd = [method, fpath] #delete shlex
    p = Popen(cmd, stdout=PIPE)
    output, _ = p.communicate()
    if p.returncode:
        return ''
    output = output.split()
    return output[0]

There are 4 columns about my test results time of calculate md5 and sha1.

1th column are cal time of coreutils md5sum or sha1sum.

2th column are cal time of python hashlib md5 or sha1, by reading 1048576 chunk.

3th column are cal time of python hashlib md5 or sha1, by reading 32768 chunk.

4th column are cal time of python hashlib md5 or sha1, by reading 512 chunk.

4.08805298805 3.81827783585 3.72585606575 5.72505903244
6.28456497192 3.69725108147 3.59885907173 5.69266486168
4.08003306389 3.82310700417 3.74562311172 5.74706888199
6.25473690033 3.70099711418 3.60972714424 5.70108985901
4.07995700836 3.83335709572 3.74854302406 5.74988412857
6.26068210602 3.72050404549 3.60864400864 5.69080018997
4.08979201317 3.83872914314 3.75350999832 5.79242300987
6.28977203369 3.69586396217 3.60469412804 5.68853116035
4.0824379921 3.83340883255 3.74298214912 5.73846316338
6.27566385269 3.6986720562 3.6079480648 5.68188500404
4.10092496872 3.82357311249 3.73044300079 5.7778570652
6.25675201416 3.78636980057 3.62911510468 5.71392583847
4.09579920769 3.83730792999 3.73345088959 5.73320293427
6.26580905914 3.69428491592 3.61320495605 5.69155502319
4.09030103683 3.82516098022 3.73244214058 5.72749185562
6.26151800156 3.6951239109 3.60320997238 5.70400810242
4.07977604866 3.81951498985 3.73287010193 5.73037815094
6.26691818237 3.72077894211 3.60203289986 5.71795105934
4.08536100388 3.83897590637 3.73681998253 5.73614501953
6.2943251133 3.72131896019 3.61498594284 5.69963502884
(My computer has 4-core i3-2120 CPU @ 3.30GHz, 4G memory. 
 The file calculated by these program is about 2G in size.
 The odd rows are about md5 and the even rows are about sha1.
 The time in this table are in second.)

With more than 100 times test, I found python hashlib was always faster than md5sum or sha1sum.

I also read some docs in source code about Python2.7/Modules/{md5.c,md5.h,md5module.c} and gnulib lib/{md5.c,md5.h}. They are both implementation of MD5 (RFC 1321).

In gnulib, md5 chunk read by 32768 .

I didn't know much about md5 and C source code. Could someone help me to explain these results?

The other reason why I want to ask this question is that many people think md5sum is faster than python_hashlib for granted and they prefer to use md5sum when writting python code. But it seems wrong.

coreutils had it's own C implementation, whereas python calls out to libcrypto with architecture specific assembly implementations. The difference is even greater with sha1. Now this has been fixed up in coreutils-8.22 (when configured --with-openssl), and is enabled in newer distos like Fedora 21, RHEL 7 and Arch, etc.

Note calling out to the command even though currently slower on some systems is a better long term strategy as one can take advantage of all the logic encapsulated within the separate commands, rather than reimplementing. For example in coreutils there is pending support for improved reading of sparse files so that zeros are not redundantly read from the kernel etc. Better take advantage of that transparently if possible.

我不确定您是如何安排时间的,但是差异可能是由于您每次调用shell_hash花费了时间来旋转一个子shlex.split (还要考虑shlex.split的解析时间)。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM