稀疏矩阵密集向量乘法与编译时已知的矩阵

Question

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are).我有一个只有零和一个作为条目的稀疏矩阵（例如，形状为 32k x 64k 和 0.01% 的非零条目，并且在非零条目的位置方面没有可利用的模式）。 The matrix is known at compile time.该矩阵在编译时是已知的。 I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros.我想用包含 50% 的 1 和 0 的非稀疏向量（在编译时未知）执行矩阵向量乘法（模 2）。 I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.我希望这是有效的，特别是，我试图利用矩阵在编译时已知的事实。

Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me.以有效的格式存储矩阵（仅保存“1”的索引）总是需要几 MB 的 memory 并将矩阵直接嵌入到可执行文件中对我来说似乎是个好主意。 My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries.我的第一个想法是自动生成 C++ 代码，它将所有结果向量条目分配给正确输入条目的总和。 This looks like this:这看起来像这样：

constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;

template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
    out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
    out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
    out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
    out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
    // LOTS more of this...
    out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}

This does in fact work (takes ages to compile).这确实有效（编译需要很长时间）。 However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary.然而，它实际上似乎非常慢（比 Julia 中相同的稀疏向量矩阵乘法慢 10 倍多），而且会大大增加可执行文件的大小，远远超出我的预期。 I tried this with both std::array and std::vector , and with the individual entries (represented as Bit ) being bool , std::uint8_t and int , to no progress worth mentioning.我用std::array和std::vector尝试了这个，并且各个条目（表示为Bit ）是bool ， std::uint8_t和int ，没有值得一提的进展。 I also tried replacing the modulo and addition by XOR.我还尝试用 XOR 替换模和加法。 In conclusion, this is a terrible idea.总之，这是一个可怕的想法。 I'm not sure why though - is the sheer codesize slowing it down that much?我不确定为什么 - 纯粹的代码大小会减慢它的速度吗？ Does this kind of code rule out compiler optimization?这种代码是否排除了编译器优化？

I haven't tried any alternatives yet.我还没有尝试过任何替代方案。 The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them.我的下一个想法是将索引存储为编译时常量 arrays （仍然给我巨大的.cpp文件）并循环它们。 Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code.最初，我预计这样做会导致编译器优化生成与我自动生成的 C++ 代码相同的二进制文件。 Do you think this is worth trying (I guess I will try anyway on monday)?你认为这值得尝试吗（我想我还是会在星期一尝试）？

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.另一个想法是尝试将输入（也许还有 output？）向量存储为打包位并执行这样的计算。 I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.我希望一个人无法绕过很多位移或与操作，这最终会变得更慢更糟。

Do you have any other ideas on how this might be done?您对如何做到这一点有任何其他想法吗？

Answer 1

I'm not sure why though - is the sheer codesize slowing it down that much?我不确定为什么 - 纯粹的代码大小会减慢它的速度吗？

The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device .问题是可执行文件很大，操作系统会从您的存储设备中获取大量页面。 This process is very slow.这个过程非常缓慢。 The processor will often stall waiting for data to be loaded.处理器通常会停止等待数据加载。 And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad.即使代码已经加载到 RAM 中（操作系统缓存），它也会效率低下，因为 RAM 的速度（延迟 + 吞吐量）非常糟糕。 The main issue here is that all the instructions are executed only once .这里的主要问题是所有指令只执行一次。 If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM.如果您重用 function，则需要从缓存中重新加载代码，如果它太大而无法放入缓存中，它将从慢速 RAM 中加载。 Thus, the overhead of loading the code is very high compared to its actual execution.因此，与实际执行相比，加载代码的开销非常高。 To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data .为了克服这个问题，您需要使用一个非常小的代码，其中循环迭代相当少量的数据。

Does this kind of code rule out compiler optimization?这种代码是否排除了编译器优化？

This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).这取决于编译器，但大多数主流编译器（例如 GCC 或 Clang）将以相同的方式优化代码（因此编译时间很慢）。

Do you think this is worth trying (I guess I will try anyway on monday)?你认为这值得尝试吗（我想我还是会在星期一尝试）？

Yes, this solution is clearly better, especially if the indices are stored in a compact way.是的，这个解决方案显然更好，特别是如果索引以紧凑的方式存储。 In your case, you can store them using an uint16_t type.在您的情况下，您可以使用 uint16_t 类型存储它们。 All the indices can be put in a big buffer.所有索引都可以放在一个大缓冲区中。 The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers).每行索引的开始/结束 position 可以在引用第一个缓冲区（或使用指针）的另一个缓冲区中指定。 This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop).此缓冲区可以在应用程序开头的 memory 中从专用文件加载一次，以减小生成程序的大小（并避免在关键循环中从存储设备中提取）。 With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM.具有非零值的概率为 0.01%，生成的数据结构将占用少于 500 KiB 的 RAM。 On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized .在一般的主流桌面处理器上，它可以放入 L3 缓存（相当快），我认为假设multiply代码经过仔细优化，您的计算时间不应超过 1 毫秒。

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.另一个想法是尝试将输入（也许还有 output？）向量存储为打包位并执行这样的计算。

Bit-packing is good only if your matrix is not too sparse.仅当您的矩阵不太稀疏时，位打包才是好的。 With a matrix filled with 50% of non-zero values, the bit-packing method is great.对于一个填充了 50% 的非零值的矩阵，位打包方法非常棒。 With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.对于 0.01% 的非零值，位打包方法显然很糟糕，因为它占用了太多空间。

I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.我希望一个人无法绕过很多位移或与操作，这最终会变得更慢更糟。

As previously said, loading data from the storage device or the RAM is very slow.如前所述，从存储设备或 RAM 加载数据非常慢。 Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).在任何现代主流处理器上进行一些位移都非常快（并且比加载数据快得多）。

Here is the approximate timings for various operations that a computer can do:以下是计算机可以执行的各种操作的大致时间：

Answer 2

I implemented the second method ( constexpr arrays storing the matrix in compressed column storage format) and it is a lot better.我实现了第二种方法（ constexpr arrays 以压缩列存储格式存储矩阵），它好多了。 It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). （对于包含 35'000 个的 64'000 x 22'000 二进制矩阵）使用-O3编译需要 <1 分钟，并在我的笔记本电脑上在 <300 微秒内执行一次乘法（对于相同的计算，Julia 大约需要 350 微秒）。 The total executable size is ~1 Mbyte.总可执行文件大小约为 1 MB。

Probably one can still do a lot better.也许一个人仍然可以做得更好。 If anyone has an idea, let me know!如果有人有想法，请告诉我！

Below is a code example (showing a 5x10 matrix) illustrating what I did.下面是一个代码示例（显示一个 5x10 矩阵），说明了我所做的。

#include <iostream>
#include <array>

// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};

template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {

    for (std::size_t col = 0; col < N; col++) {
        for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
            out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
        }
    }
}

int main() {
    using Bit = bool;
    std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
    std::array<Bit, M> output{};
    
    for (auto i : input) std::cout << i;
    std::cout << std::endl;

    encode(input, output);

    for (auto i : output) std::cout << i;
}

稀疏矩阵密集向量乘法与编译时已知的矩阵

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-04-24 23:30:30

解决方案2
1 2021-04-25 19:42:08

稀疏矩阵密集向量乘法与编译时已知的矩阵

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-04-24 23:30:30

解决方案2 1 2021-04-25 19:42:08

解决方案1
3 已采纳 2021-04-24 23:30:30

解决方案2
1 2021-04-25 19:42:08