Sparse matrix-dense vector multiplication with matrix known at compile time

Question

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.

Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:

constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;

template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
    out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
    out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
    out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
    out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
    // LOTS more of this...
    out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}

This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector , and with the individual entries (represented as Bit ) being bool , std::uint8_t and int , to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?

I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.

Do you have any other ideas on how this might be done?

Answer 1

I'm not sure why though - is the sheer codesize slowing it down that much?

The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device . This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once . If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data .

Does this kind of code rule out compiler optimization?

This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).

Do you think this is worth trying (I guess I will try anyway on monday)?

Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized .

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.

Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.

I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.

As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).

Here is the approximate timings for various operations that a computer can do:

Answer 2

I implemented the second method ( constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.

Probably one can still do a lot better. If anyone has an idea, let me know!

Below is a code example (showing a 5x10 matrix) illustrating what I did.

#include <iostream>
#include <array>

// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};

template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {

    for (std::size_t col = 0; col < N; col++) {
        for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
            out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
        }
    }
}

int main() {
    using Bit = bool;
    std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
    std::array<Bit, M> output{};
    
    for (auto i : input) std::cout << i;
    std::cout << std::endl;

    encode(input, output);

    for (auto i : output) std::cout << i;
}

Sparse matrix-dense vector multiplication with matrix known at compile time

Question

2 answers

solution1
3 ACCPTED 2021-04-24 23:30:30

solution2
1 2021-04-25 19:42:08

Sparse matrix-dense vector multiplication with matrix known at compile time

Question

2 answers

solution1 3 ACCPTED 2021-04-24 23:30:30

solution2 1 2021-04-25 19:42:08

solution1
3 ACCPTED 2021-04-24 23:30:30

solution2
1 2021-04-25 19:42:08