How to improve the speed of merkle root calculation in C++?

Question

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>



std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
    unsigned char inp[64];
    int j=0;
    for (int i=0; i<32; i++)
    {
        inp[j] = a[i];
        j++;
    }
    for (int i=0; i<32; i++)
    {
        inp[j] = b[i];
        j++;
    }

    const EVP_MD *md_algo = EVP_sha256();
    unsigned int md_len = EVP_MD_size(md_algo);
    std::vector<unsigned char> out( md_len );
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
   std::vector<std::vector<unsigned char> > out;
   int len = inp_list.size();
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(
            double_sha256(inp_list[i], inp_list[i+1])
        );
   }
   if (len % 2 == 1)
   {
        out.push_back(
            double_sha256(inp_list[len-1], inp_list[len-1])
        );
   }
   return calculate_merkle_root(out);
}



int main()
{
    std::ifstream infile("txids.txt");

    std::vector<std::vector<unsigned char> > txids;
    std::string line;
    int count = 0;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        std::vector<unsigned char> buf2;
        for (int i=31; i>=0; i--)
        {
            buf2.push_back(
                buf[i]
            );
        }
        txids.push_back(
            buf2
        );
        count++;
    }
    infile.close();
    std::cout << count << std::endl;

    std::vector<std::vector<unsigned char> > merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    std::vector<unsigned char> out0 = merkle_root_hash[0];
    std::vector<unsigned char> out;
    for (int i=31; i>=0; i--)
    {
        out.push_back(
            out0[i]
        );
    }

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

However, the performance is worse compared to the Python implementation (~4s):

$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out 
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e

real      0m9.245s
user      0m9.235s
sys       0m0.008s

The complete implementation and the input file are available here: test.cpp and txids.txt .

How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?

Answer 1

I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2 , AVX2 , AVX512 .

My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.

Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post . Python implementation is definitely easier to use than C++.

My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.

First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:

MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms

Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.

Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:

Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib , provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 . Here also instead of of AVX2 you may use SSE2 or AVX512 . Windows openssl can be downloaded from here .
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe . As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here ). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.

My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.

I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.

Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):

MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms

Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD -based implementation can be considerably faster than OpenSSL version.

If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.

Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.

Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.

sha256_simd.cpp

Answer 2

There are plenty of things you can do to optimize the code.

Here is the list of the important points:

compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf ;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;

Here is the resulting code:

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>

using Hash = std::array<unsigned char, 32>;

Hash double_sha256(const Hash& a, const Hash& b)
{
    assert(a.size() == 32 && b.size() == 32);

    unsigned char inp[64];
    std::copy(a.begin(), a.end(), inp);
    std::copy(b.begin(), b.end(), inp+32);

    const EVP_MD *md_algo = EVP_sha256();
    assert(EVP_MD_size(md_algo) == 32);

    unsigned int md_len = 32;
    Hash out;
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
   std::vector<Hash> out;
   int len = inp_list.size();
   out.reserve(len/2+2);
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
   }
   if (len % 2 == 1)
   {
        out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
   }
   return calculate_merkle_root(out);
}

int main()
{
    std::ifstream infile("txids.txt");

    std::vector<Hash> txids;
    std::string line;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        Hash buf2;
        std::copy(buf, buf+32, buf2.begin());
        std::reverse(buf2.begin(), buf2.end());
        txids.push_back(buf2);
        OPENSSL_free(buf);
    }
    infile.close();
    std::cout << txids.size() << std::endl;

    std::vector<Hash> merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    Hash out0 = merkle_root_hash[0];
    Hash out = out0;
    std::reverse(out.begin(), out.end());

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.

This implementation spends >98% of its time in EVP_Digest . As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).

How to improve the speed of merkle root calculation in C++?

Question

2 answers

solution1
3 2021-05-08 17:46:12

solution2
2 ACCPTED 2021-05-03 08:47:48

How to improve the speed of merkle root calculation in C++?

Question

2 answers

solution1 3 2021-05-08 17:46:12

solution2 2 ACCPTED 2021-05-03 08:47:48

solution1
3 2021-05-08 17:46:12

solution2
2 ACCPTED 2021-05-03 08:47:48