如何提高默克尔根计算的速度？

Question

My implementation in Python calculates the merkle root hash for ~1500 input hashes:我在 Python 中的实现为 ~1500 个输入哈希计算 merkle 根 hash：

import numpy as np
from binascii import unhexlify, hexlify
from hashlib import sha256

txids = np.loadtxt("txids.txt", dtype=str)

def double_sha256(a, b):
    inp = unhexlify(a)[::-1] + unhexlify(b)[::-1]
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return hexlify(sha2[::-1])


def calculate_merkle_root(inp_list):
    if len(inp_list) == 1:
        return inp_list[0]
    out_list = []
    for i in range(0, len(inp_list)-1, 2):
        out_list.append(double_sha256(inp_list[i], inp_list[i+1]))
    if len(inp_list) % 2 == 1:
        out_list.append(double_sha256(inp_list[-1], inp_list[-1]))
    return calculate_merkle_root(out_list)

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)

print(merkle_root_hash)

Since the merkle root is calculated 1000 times, it takes ~5ms for one calculation:由于 merkle 根计算了 1000 次，因此一次计算大约需要 5ms：

$ time python3 test.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m5.132s
user    0m5.501s
sys     0m0.133s

How could I improve the speed of the calculation?如何提高计算速度？ Can this code be optimized?这段代码可以优化吗？

So far, I have tried to unroll the recursive function in Python and C++.到目前为止，我已经尝试在 Python 和 C++ 中展开递归 function。 However, the performance did not increase, it took ~6ms.但是，性能并没有提高，大约花了 6 毫秒。

EDIT编辑

The file is available here: txids.txt该文件可在此处获得： txids.txt

EDIT 2编辑 2

Due to the suggestion in a comment, I removed the unnecessary steps of unhexlify and hexlify .由于评论中的建议，我删除了unhexlify和hexlify的不必要步骤。 Before the loop the list is prepared once.在循环之前，列表准备一次。

def double_sha256(a, b):
    inp = a + b
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return sha2

def map_func(t):
    return unhexlify(t)[::-1]
txids = list(map(map_func, txids))

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])

Now the execution is ~4ms:现在执行时间约为 4 毫秒：

$ time python3 test2.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m3.697s
user    0m4.069s
sys     0m0.128s

Answer 1

I decided to implement SHA-256 fully from scratch and using SIMD instructions set (read about them here SSE2 , AVX2 , AVX512 ).我决定从头开始完全实现 SHA-256 并使用SIMD指令集（在此处阅读它们SSE2 、 AVX2 、 AVX512 ）。

As a result my code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.因此，我下面的 AVX2 案例代码的速度比 OpenSSL 版本快3.5x倍，比 Python 的hashlib实现快7.3x倍。

I also created related second post regarding C++ version, see it here .我还创建了有关 C++ 版本的相关第二篇文章，请参见此处。 Read C++ post to figure out more details about my library, this Python post is more high level.阅读 C++ 帖子以了解有关我的库的更多详细信息，此 Python 帖子更高级。

First providing timings:首先提供时间：

simple 3.006
openssl 1.426
simd gen 1 1.639
simd gen 2 1.903
simd gen 4 0.847
simd gen 8 0.457
simd sse2 1 0.729
simd sse2 2 0.703
simd sse2 4 0.718
simd sse2 8 0.776
simd avx2 1 0.461
simd avx2 2 0.41
simd avx2 4 0.549
simd avx2 8 0.521

Here simple is hashlib's version close to that provided by you, openssl stands for OpenSSL version, remaining simd versions are mine SIMD (SSE2/AVX2/AVX512) implementations.这里simple的是hashlib的版本接近你提供的版本， openssl代表OpenSSL版本，其余的simd版本是我的SIMD（SSE2/AVX2/AVX512）实现。 As you can see AVX2 version is 3.5x times faster than OpenSSL version and 7.3x times faster than native Python's hashlib .如您所见，AVX2 版本比OpenSSL版本快3.5x倍，比原生 Python 的hashlib快7.3x倍。

Timings above were done in Google Colab as they have quite advanced AVX2 CPUs available.上述时序是在 Google Colab中完成的，因为它们具有相当先进的 AVX2 CPU。

Providing code of the library at the bottom, as code is very huge it is posted as separate links because it doesn't fit into 30 KB limit of StackOverflow.在底部提供库的代码，因为代码非常庞大，它作为单独的链接发布，因为它不符合 StackOverflow 的30 KB限制。 There are two file sha256_simd.py and sha256_simd.hpp .有两个文件sha256_simd.py和sha256_simd.hpp 。 Python's file contains timings and usage examples and also Cython -based wrapper to use my C++ library shipped in.hpp file. Python 的文件包含计时和使用示例以及基于Cython的包装器，以使用我在.hpp 文件中提供的 C++ 库。 This python file contains everything needed to compile and run code, just place both of these files nearby and run python file.这个 python 文件包含编译和运行代码所需的一切，只需将这两个文件放在附近并运行 python 文件。

I tested this program/library both on Windows (MSVC compiler) and Linux (CLang compiler).我在 Windows（MSVC 编译器）和 Linux（CLang 编译器）上测试了这个程序/库。

Examples of usage of my library is located in merkle_root_simd_example() and main() functions.我的库的使用示例位于merkle_root_simd_example()和main()函数中。 Basically you do following things:基本上你做以下事情：

First import my library through mod = sha256_simd_import(cap = 'avx2') , do this only one time per program run, don't do this multiple times, remember this returned module into some global variable.首先通过mod = sha256_simd_import(cap = 'avx2')导入我的库，每个程序运行只执行一次，不要多次执行，记住这个返回的模块到一些全局变量中。 In cap parameter you should put whatever your CPU supports, it can be gen or sse2 or avx2 or avx512 in order of increasing technology complexity and improved speed.在cap参数中，您应该放置您的 CPU 支持的任何内容，它可以是gen或sse2或avx2或avx512以增加技术复杂性和提高速度。 gen is generic non-SIMD operations, sse2 is 128-bit operations, avx2 is 256-bit operations, avx512 is 512-bit operations. gen是通用的非 SIMD 操作， sse2是 128 位操作， avx2是 256 位操作， avx512是 512 位操作。
After importing use imported module for example like mod.merkle_root_simd('avx2', 2, txs) .导入后使用导入的模块，例如mod.merkle_root_simd('avx2', 2, txs) 。 Here you put again one of gen / sse2 / avx2 / avx512 technology.在这里，您再次放置了gen / sse2 / avx2 / avx512技术之一。 Why again?为什么又来了？ First time when importing you put compilation option which tells compiler to support given and all below technologies.第一次导入时放置编译选项，告诉编译器支持给定的和所有以下技术。 Here you put SIMD technology that will be used for merkle-root call, this technology can be lower (but not higher) than compilation technology.这里你放了将用于 merkle-root 调用的 SIMD 技术，该技术可以低于（但不能高于）编译技术。 For example if you compiled for avx2 then you can use library for gen or sse2 or avx2 , but not for avx512 .例如，如果您为avx2编译，那么您可以将库用于gen或sse2或avx2 ，但不能用于avx512 。
You can see in 2) that I used options ('avx2', 2, txs) , here 2 means parallelization parameter, it is not multi-core but single core parallelization, meaning that two avx2 registers will be computed in a row. 2) 中可以看到我使用了 options ('avx2', 2, txs) ，这里的2表示并行化参数，不是多核而是单核并行化，意思是连续计算两个 avx2 寄存器。 You should put 1 or 2 or 4 or 8, whatever is giving a faster computation for you.您应该输入 1 或 2 或 4 或 8，无论您的计算速度如何。

In order for library to be used you have to have installed two things - one is compiler (MSVC for Windows and CLang (or GCC) for Linux), second - install one time Cython module through python -m pip install cython , Cython is an adavnced library for programming C++ code inside Python, here it acts as a thin wrapper between my Python's .py and C++'s .hpp modules. In order for library to be used you have to have installed two things - one is compiler (MSVC for Windows and CLang (or GCC) for Linux), second - install one time Cython module through python -m pip install cython , Cython is an用于在 Python 中编程 C++ 代码的高级库，在这里它充当我的 Python 的.py和 C++ 的.hpp模块之间的薄包装器。 Also my code is programmed using most modern C++20 standard, be aware of this, you have to have most updated C++ compiler to be able to compile my code, for that download latest MSVC on Windows and/or latest CLang for Linux (through command bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" which is described here ). Also my code is programmed using most modern C++20 standard, be aware of this, you have to have most updated C++ compiler to be able to compile my code, for that download latest MSVC on Windows and/or latest CLang for Linux (通过命令bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"此处描述）。

In.py file you could see that I sometimes provide extra params has_ossl = True, win_ossl_dir = 'd:/bin/openssl/' , these two params are needed only if you need OpenSSL version being compiled-in into my library.在.py 文件中，您可以看到我有时会提供额外的参数has_ossl = True, win_ossl_dir = 'd:/bin/openssl/' ，仅当您需要将 OpenSSL 版本编译到我的库中时才需要这两个参数。 Windows openssl can be downloaded from here . Windows openssl 可以从这里下载。 Later openssl version can be used through mod.merkle_root_ossl(txs) , just providing single parameter with transactions.后来 openssl 版本可以通过mod.merkle_root_ossl(txs)使用，只需提供单个参数与交易。

In all versions of functions in my.py module you need to provide list of bytes for transactions, meaning that if you have hex transactions then you have to unhexlify them first.在 my.py 模块中的所有函数版本中，您需要为事务提供字节列表，这意味着如果您有十六进制事务，那么您必须先取消它们的十六进制。 Also all functions return bytes hash meaning that you have to hexlify it if you need.此外，所有函数都返回字节 hash，这意味着如果需要，您必须对其进行 hexlify。 This bytes-only transfer there and back is for performance reason only.这种仅字节传输仅出于性能原因。

I understand that my code is very complex to understand and use.我知道我的代码很难理解和使用。 So if you are very serious about your desire to have fastest code then please ask me questions about how to use and understand my code, if you want.因此，如果您对拥有最快代码的愿望非常认真，那么如果您愿意，请向我询问有关如何使用和理解我的代码的问题。 Also I should say that my code is quite dirty, I didn't mean to make a clean-shiny library for all people use, I just wanted to make Proof-of-Concept that SIMD version is considerably faster than hashlib's version and even openssl version, of cause only if your CPU is quite advanced to support at least one of SSE2/AVX2/AVX512, most CPUs support SSE2, but not all support even AVX2 and AVX512.另外我应该说我的代码很脏，我并不是要为所有人制作一个干净闪亮的库，我只是想证明 SIMD 版本比 hashlib 版本甚至 openssl 快得多版本，仅当您的 CPU 非常先进以支持 SSE2/AVX2/AVX512 中的至少一个时，大多数 CPU 都支持 SSE2，但并非所有 CPU 都支持 AVX2 和 AVX512。

sha256_simd.py sha256_simd.py

sha256_simd.hpp sha256_simd.hpp

Answer 2

In the last update (2 may 2021 at 17:00), the calls to sha256(value).digest() takes roughly 80% of the time on my machine.在上次更新（2021 年 5 月 2 日 17:00）中，对sha256(value).digest()的调用大约占用了我机器上 80% 的时间。 There are few possible solution to fix that.解决这个问题的可能解决方案很少。

The first is to parallelize the computation using multiprocessing assuming the work is independent for each iteration.第一个是使用multiprocessing并行化计算，假设每次迭代的工作都是独立的。 Here is an example:这是一个例子：

from multiprocessing.pool import Pool

# [...] same as in the question

def iteration(txids):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])
    return merkle_root_hash

processPool = Pool()
res = processPool.map(iteration, [txids for i in range(1000)])

print(res[-1])

This is 4 times faster on my 6-core machine.这在我的 6 核机器上快了 4 倍。

Another solution is to find a faster Python module that can compute multiple sha256 hashes at the same time to reduce the expensive C calls from the CPython interpreter.另一个解决方案是找到一个更快的 Python 模块，该模块可以同时计算多个 sha256 哈希，以减少来自 CPython 解释器的昂贵的 C 调用。 I am not aware of any package doing this.我不知道有任何 package 这样做。

Finally, one efficient solution is to (at least partially) rewrite the expensive calculate_merkle_root computation in C or C++ and run it in parallel.最后，一种有效的解决方案是（至少部分地）重写 C 或 C++ 中昂贵的calculate_merkle_root计算并并行运行。 This should be significantly faster than your current code as this removes the function call overhead and the multiprocessing cost.这应该比您当前的代码快得多，因为这消除了 function 调用开销和多处理成本。 There are many libraries to compute a sha256 hash (like the Crypto++ library).有许多库可以计算 sha256 hash（如Crypto++库）。

如何提高默克尔根计算的速度？

问题描述

2 个解决方案

解决方案1
4 2021-05-08 17:16:21

解决方案2
1 2021-05-02 15:09:17

如何提高默克尔根计算的速度？

问题描述

2 个解决方案

解决方案1 4 2021-05-08 17:16:21

解决方案2 1 2021-05-02 15:09:17

解决方案1
4 2021-05-08 17:16:21

解决方案2
1 2021-05-02 15:09:17