二进制矩阵矢量乘法的本征

Question

I am trying to implement a matrix vector multiplication over a binary field. 我正在尝试在二进制字段上实现矩阵矢量乘法。 The vector x is of dimension 1xa and the matrix M is of dimension axb and the the result y = a * M is of size 1xb. 向量x的尺寸为1xa，矩阵M的尺寸为axb，结果y = a * M的尺寸为1xb。 Right now, I implemented it such that x and M are of type uint8_t*, ie, I concatenate the columns of M as they are also accessed successively. 现在，我实现了它，使x和M的类型为uint8_t *，即，我将M的列连接起来，因为它们也可以连续访问。 The function looks like: 该函数如下所示：

void mul(uint8_t M, size_t a, size_t b, uint8_t* x, uint8_t* y) {
    uint8_t val;
    uint8_t *ptr;
    for(size_t i = 0; i < b; i++) {
        val = 0;
        ptr = M + i * a;
        for(size_t j = 0; j < a; j++) {
            val ^= (x[j] & *ptr++);
        }
        y[i] = bit;
    }
}

M and x have been allocated by the caller as M和x已由调用方分配为

M = malloc(sizeof(uint8_t) * a * b);
x = malloc(sizeof(uint8_t) * a);
y = malloc(sizeof(uint8_t) * b);

As this routine is called billion of times, I need to optimize the shit out of it ;) To do so, I was thinking of 由于该例程被称为数十亿次，因此我需要对其进行优化；）为此，我在想

instead of representing each 0/1 as a separate uint8_t (ie, 8 bits) I could pack all bits in "x" and "M" into arrays of uint64_t of much smaller size, eg, ap and Mp, where 而不是将每个0/1表示为单独的uint8_t（即8位），我可以将“ x”和“ M”中的所有位打包到尺寸更小的uint64_t数组中，例如ap和Mp，其中

ap = (size_t) ceil ((double) a / 64); ap =（size_t）ceil（（double）a / 64）; Mp = (size_t) ceil ((double) (a*b) / 64); Mp =（size_t）ceil（（double）（a * b）/ 64）;

using vector intrinsics. 使用向量内在函数。

So far, I accomplished the (left aligned) packing (with proper alignment) of M and the multiplication as 到目前为止，我完成了M的（左对齐）打包（正确对齐），并且乘法

typedef uint64_t word;
#define WORD_BITS      (CHAR_BIT * sizeof (word))

void mul_fast(word *M, size_t Mlen, word *x, size_t xlen, size_t b, word *y) {

    for(size_t i = 0; i < Mlen; i++) {
        y[i/xlen] ^= (M[i] & x[i % xlen]);
    }
    for(size_t i = 0; i < b; i++) {
        y[i] = __builtin_popcountll(y[i]) & 1;
    }
}

However, it turns out the above is much slower then the straight-forward implementation of mul(). 但是，事实证明，上述方法要比mul（）的直接实现慢得多。

Do you have any ideas or references? 您有什么想法或参考吗？ I am not an assembler expert, so comparing the output of gcc -S does not tell me much :/ 我不是汇编专家，所以比较gcc -S的输出不会告诉我太多：/

Thank you, best regards, Tom. 谢谢您，汤姆。

Answer 1

The relevant difference in the assembler output is: 汇编程序输出中的相关差异是：

.L26: - movq %r10, %rax - xorl %edx, %edx - divq %rcx - movq (%r11,%rdx,8), %rdx - andq (%rdi,%r10,8), %rdx - addq $1, %r10 - xorq %rdx, (%r9,%rax,8) - cmpq %r10, %rsi + movq %rax, %rcx + movq %rax, %r10 + andl $1, %ecx + shrq %r10 + movq (%rdx,%rcx,8), %rcx + andq (%rdi,%rax,8), %rcx + addq $1, %rax + xorq %rcx, (%r9,%r10,8) + cmpq %rax, %rsi Can you see what the culprit was? .L26: - movq %r10, %rax - xorl %edx, %edx - divq %rcx - movq (%r11,%rdx,8), %rdx - andq (%rdi,%r10,8), %rdx - addq $1, %r10 - xorq %rdx, (%r9,%rax,8) - cmpq %r10, %rsi + movq %rax, %rcx + movq %rax, %r10 + andl $1, %ecx + shrq %r10 + movq (%rdx,%rcx,8), %rcx + andq (%rdi,%rax,8), %rcx + addq $1, %rax + xorq %rcx, (%r9,%r10,8) + cmpq %rax, %rsi可以看到罪魁祸首吗？

二进制矩阵矢量乘法的本征

问题描述

1 个解决方案

解决方案1
0 2019-01-03 23:04:48

二进制矩阵矢量乘法的本征

问题描述

1 个解决方案

解决方案1 0 2019-01-03 23:04:48

解决方案1
0 2019-01-03 23:04:48