简体   繁体   English

如何让 clang/gcc 向量化循环数组比较?

[英]How do I get clang/gcc to vectorize looped array comparisons?

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling.查看程序集,除了循环展开之外,clang 和 gcc 似乎没有任何优化要添加(带有标志 -O3 -mavx512f -msse4.2)。 I would think its pretty easy to just put both memory regions in 512 bit registers and compare them.我认为将两个 memory 区域放在 512 位寄存器中并进行比较非常容易。 Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):更令人惊讶的是,两个编译器也未能优化这一点(理想情况下,只需要一个 64 位比较,不需要特殊的大寄存器):

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized?那么这两个编译器都是愚蠢的,还是有原因导致这段代码没有被矢量化? And is there any way to force vectorization short of writing inline assembly?除了编写内联汇编之外,还有什么方法可以强制矢量化?

"I assume" the following is most efficient “我认为”以下是最有效的

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!特别是对于巨大的阵列!

(For small arrays, there is not a lot to gain anyway!) (对于小arrays,反正也没有什么收获!)

Otherwise, you would need to vectorize manually using Intel Intrinsics.否则,您需要使用 Intel Intrinsics 手动进行矢量化。 (Also mentioned by chtz.) I started to look at that until i thought about memcmp . (chtz 也提到过。)我开始研究它,直到我想到memcmp

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.编译器必须假设一旦 function 返回(或退出循环),它就无法读取当前索引后面的任何字节——例如,如果其中一个指针恰好指向无效 memory 边界附近的某个位置。 You can give the compiler a chance to optimize this by using (non-lazy) bitwise & / |您可以通过使用(非惰性)按位& / |让编译器有机会对此进行优化。 operators to combine the results of the individual comparisons:运算符来组合各个比较的结果:

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64; ++ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM Godbolt 演示: https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though.不过,gcc 仍然无法对其进行矢量化。 So you may need to write manual vectorized versions of this, if this function is performance critical.因此,如果此 function 对性能至关重要,您可能需要编写此的手动矢量化版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何让 gcc 完全矢量化这个 sqrt 循环? - How can you get gcc to fully vectorize this sqrt loop? 如何使gcc向量化此循环 - how to enable gcc to vectorize this loop 如何使C ++(共享)库与clang和GCC兼容? - How do I make a C++ (shared) library compatible with clang and GCC? Qt 5.1 for OSX安装仅包含clang_64目录,如何使用macports gcc进行编译? - Qt 5.1 for OSX installation only includes clang_64 directory, how do I compile with macports gcc? 如何使用 Clang/GCC 在 Mac 上为 C/C++ 设置 VSCode? - How do I set up VSCode for C/C++ on Mac with Clang/GCC? 如何通过 gcc/g++ 或 clang 构建和使用 googletest (gtest) 和 googlemock (gmock)? - How do I build and use googletest (gtest) and googlemock (gmock) with gcc/g++ or clang? 如何使用仿函数对循环进行矢量化? - How do I vectorize loop with functors? 就嵌入式系统的大小而言,我能得到clang或gcc之类的成熟编译器有多小? - How small can I get full fledged compiler like clang or gcc in terms of size for an embedded system? 如何使用clang格式控制数组初始值设定项的缩进? - How do I control indentation of array initializers with clang-format? 你如何为 clang 和 gcc 编写一个 makefile? - How do you write a makefile for both clang and gcc?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM