如何让 clang/gcc 向量化循环数组比较？

Question

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling.查看程序集，除了循环展开之外，clang 和 gcc 似乎没有任何优化要添加（带有标志 -O3 -mavx512f -msse4.2）。 I would think its pretty easy to just put both memory regions in 512 bit registers and compare them.我认为将两个 memory 区域放在 512 位寄存器中并进行比较非常容易。 Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):更令人惊讶的是，两个编译器也未能优化这一点（理想情况下，只需要一个 64 位比较，不需要特殊的大寄存器）：

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized?那么这两个编译器都是愚蠢的，还是有原因导致这段代码没有被矢量化？ And is there any way to force vectorization short of writing inline assembly?除了编写内联汇编之外，还有什么方法可以强制矢量化？

Answer 1

"I assume" the following is most efficient “我认为”以下是最有效的

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!特别是对于巨大的阵列！

(For small arrays, there is not a lot to gain anyway!) （对于小arrays，反正也没有什么收获！）

Otherwise, you would need to vectorize manually using Intel Intrinsics.否则，您需要使用 Intel Intrinsics 手动进行矢量化。 (Also mentioned by chtz.) I started to look at that until i thought about memcmp . （chtz 也提到过。）我开始研究它，直到我想到memcmp 。

Answer 2

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.编译器必须假设一旦 function 返回（或退出循环），它就无法读取当前索引后面的任何字节——例如，如果其中一个指针恰好指向无效 memory 边界附近的某个位置。 You can give the compiler a chance to optimize this by using (non-lazy) bitwise & / |您可以通过使用（非惰性）按位& / |让编译器有机会对此进行优化。 operators to combine the results of the individual comparisons:运算符来组合各个比较的结果：

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64; ++ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM Godbolt 演示： https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though.不过，gcc 仍然无法对其进行矢量化。 So you may need to write manual vectorized versions of this, if this function is performance critical.因此，如果此 function 对性能至关重要，您可能需要编写此的手动矢量化版本。

如何让 clang/gcc 向量化循环数组比较？

问题描述

2 个解决方案

解决方案1
3 已采纳 2022-08-17 21:14:59

解决方案2
0 2022-08-17 21:07:41

如何让 clang/gcc 向量化循环数组比较？

问题描述

2 个解决方案

解决方案1 3 已采纳 2022-08-17 21:14:59

解决方案2 0 2022-08-17 21:07:41

解决方案1
3 已采纳 2022-08-17 21:14:59

解决方案2
0 2022-08-17 21:07:41