如何使用 SIMD 加速 XOR 两个内存块？

Question

I want to XOR two blocks of memory as quickly as possible, How can I use SIMD to accelerate it?我想尽快对两个内存块进行异或，如何使用 SIMD 来加速它？

My original code is below:我的原始代码如下：

void region_xor_w64(   unsigned char *r1,         /* Region 1 */
                       unsigned char *r2,         /* Region 2 */
                       int nbytes)       /* Number of bytes in region */
{
    uint64_t *l1;
    uint64_t *l2;
    uint64_t *ltop;
    unsigned char *ctop;

    ctop = r1 + nbytes;
    ltop = (uint64_t *) ctop;
    l1 = (uint64_t *) r1;
    l2 = (uint64_t *) r2;

    while (l1 < ltop) {
        *l2 = ((*l1)  ^ (*l2));
        l1++;
        l2++;
    }
}

I wrote one myself, but little speed increased.我自己写了一个，但速度没有提高。

void region_xor_sse(   unsigned char* dst,
                       unsigned char* src,
                       int block_size){
  const __m128i* wrd_ptr = (__m128i*)src;
  const __m128i* wrd_end = (__m128i*)(src+block_size);
  __m128i* dst_ptr = (__m128i*)dst;

  do{
    __m128i xmm1 = _mm_load_si128(wrd_ptr);
    __m128i xmm2 = _mm_load_si128(dst_ptr);

    xmm2 = _mm_xor_si128(xmm1, xmm2);
    _mm_store_si128(dst_ptr, xmm2);
    ++dst_ptr;
    ++wrd_ptr;
  }while(wrd_ptr < wrd_end);
}

Answer 1

The more important question is why would you want to do it manually.更重要的问题是您为什么要手动执行此操作。 Do you have an ancient compiler that you think you can outsmart?你有一个古老的编译器，你认为你可以智胜吗？ Those good old times when you had to manually write SIMD instructions are over.那些不得不手动编写 SIMD 指令的美好时光已经结束。 Today, in 99% of cases compiler will do the job for you, and chances are than it will do a lot better job.今天，在 99% 的情况下，编译器会为你完成这项工作，而且很有可能比它做得更好。 Also, don't forget that there are new architectures coming out every once in a while with more and more extended instruction set.另外，不要忘记每隔一段时间就会出现新的架构，并带有越来越多的扩展指令集。 So ask yourself a question — do you want to maintain N copies of your implementation for each platform?所以问自己一个问题——你想为每个平台维护 N 个实现副本吗？ Do you want to constantly test your implementation to make sure it is worth maintaining?您想不断测试您的实现以确保它值得维护吗？ Most likely the answer would be no.答案很可能是否定的。

The only thing you need to do is to write the simplest possible code.您唯一需要做的就是编写尽可能简单的代码。 Compiler will do the rest.编译器会做剩下的。 For instance, here is how I would write your function:例如，以下是我将如何编写您的函数：

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int len)
{
    unsigned int i;
    for (i = 0; i < len; ++i)
        r2[i] = r1[i] ^ r2[i];
}

A bit simpler, isn't it?简单一点，不是吗？ And guess what, compiler is generating code that performs 128-bit XOR using MOVDQU and PXOR , the critical path looks like this:猜猜看，编译器正在生成使用MOVDQU和PXOR执行 128 位异或的代码，关键路径如下所示：

4008a0:       f3 0f 6f 04 06          movdqu xmm0,XMMWORD PTR [rsi+rax*1]
4008a5:       41 83 c0 01             add    r8d,0x1
4008a9:       f3 0f 6f 0c 07          movdqu xmm1,XMMWORD PTR [rdi+rax*1]
4008ae:       66 0f ef c1             pxor   xmm0,xmm1
4008b2:       f3 0f 7f 04 06          movdqu XMMWORD PTR [rsi+rax*1],xmm0
4008b7:       48 83 c0 10             add    rax,0x10
4008bb:       45 39 c1                cmp    r9d,r8d
4008be:       77 e0                   ja     4008a0 <region_xor_w64+0x40>

As @Mysticial has pointed out, the above code is using instruction that support unaligned access.正如@Mysticial 指出的那样，上面的代码正在使用支持未对齐访问的指令。 Those are slower.那些比较慢。 If, however, a programmer can correctly assume an aligned access then it is possible to let compiler know about it.但是，如果程序员可以正确地假设对齐访问，则可以让编译器知道它。 For example:例如：

void region_xor_w64(unsigned char * restrict r1,
                    unsigned char * restrict r2,
                    unsigned int len)
{
    unsigned char * restrict p1 = __builtin_assume_aligned(r1, 16);
    unsigned char * restrict p2 = __builtin_assume_aligned(r2, 16);

    unsigned int i;
    for (i = 0; i < len; ++i)
        p2[i] = p1[i] ^ p2[i];
}

The compiler generates the following for the above C code (notice movdqa ):编译器为上述 C 代码生成以下内容（注意movdqa ）：

400880:       66 0f 6f 04 06          movdqa xmm0,XMMWORD PTR [rsi+rax*1]
400885:       41 83 c0 01             add    r8d,0x1
400889:       66 0f ef 04 07          pxor   xmm0,XMMWORD PTR [rdi+rax*1]
40088e:       66 0f 7f 04 06          movdqa XMMWORD PTR [rsi+rax*1],xmm0
400893:       48 83 c0 10             add    rax,0x10
400897:       45 39 c1                cmp    r9d,r8d
40089a:       77 e4                   ja     400880 <region_xor_w64+0x20>

Tomorrow, when I buy myself a laptop with a Haswell CPU, the compiler will generate me a code that use 256-bit instructions instead of 128-bit from the same code giving me twice the vector performance.明天，当我给自己买一台配备 Haswell CPU 的笔记本电脑时，编译器会为我生成一个代码，该代码使用 256 位指令而不是来自相同代码的 128 位指令，从而使我的向量性能提高一倍。 It would do it even if I didn't know that Haswell is capable of it.即使我不知道 Haswell 有能力，它也会这样做。 You would have to not only know about that feature, but write another version of your code and spend some time testing it.您不仅必须了解该功能，还必须编写另一个版本的代码并花一些时间对其进行测试。

By the way, it seems like you also have a bug in your implementation where the code can skip up to 3 remaining bytes in the data vector.顺便说一下，您的实现中似乎也有一个错误，其中代码最多可以跳过数据向量中剩余的 3 个字节。

At any rate, I would recommend you trust your compiler and learn how to verify what is generates (ie get familiar with objdump ).无论如何，我建议您信任您的编译器并学习如何验证生成的内容（即熟悉objdump ）。 The next choice would be to change the compiler.下一个选择是更改编译器。 Only then start thinking about writing vector processing instructions manually.然后才开始考虑手动编写向量处理指令。 Or you gonna have a bad time!否则你会过得很糟糕！

Hope it helps.希望能帮助到你。 Good Luck!祝你好运！

Answer 2

As the size of the region is passed by value why wouldn't the code be:由于区域的大小是按值传递的，为什么代码不是：

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
{
    while (i--)
        r2[i] = r1[i] ^ r2[i];
}

or even:甚至：

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
{
    while (i--)
        r2[i] ^= r1[i];
}

If there's a preference towards going forwards ('up memory') and for using pointers, then:如果倾向于前进（“向上内存”）和使用指针，则：

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
{
    while (i--)
        *r2++ ^= *r1++;
}

Answer 3

Okay, if intels prefer going forward and prefer pointer ops over indexes, then: 好吧，如果intel宁愿向前走，而宁愿使用指针操作而不是索引，那么：

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
{
    while (i--)
        *r2++ ^= *r1++;
}

Mike 麦克风

如何使用 SIMD 加速 XOR 两个内存块？

问题描述

2 个解决方案

解决方案1
10 已采纳

解决方案2
0 2019-03-08 13:13:17

解决方案3
0 2019-03-09 15:35:14

如何使用 SIMD 加速 XOR 两个内存块？

问题描述

2 个解决方案

解决方案1 10 已采纳

解决方案2 0 2019-03-08 13:13:17

解决方案3 0 2019-03-09 15:35:14

解决方案1
10 已采纳

解决方案2
0 2019-03-08 13:13:17

解决方案3
0 2019-03-09 15:35:14