为什么 gcc -O3 处理 avx256 compare intrinsic 的方式不同于 gcc -O0 和 clang？

Question

I want to set two integer vectors and compare them with SIMD, and later on use this mask for a blend operation on packed floats.我想设置两个 integer 向量并将它们与 SIMD 进行比较，然后使用此掩码对压缩浮点数进行混合操作。 I produced the following code:我制作了以下代码：

#include <immintrin.h>
#include <stdio.h>
#include <string.h>


int main(){
    __m256i is =  _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
    __m256i js =  _mm256_set1_epi32(1);               // integer bit-patterns
    __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);   // compare as subnormal floats

    float val[8];
    memcpy(val, &mask, sizeof(val));
    printf("%f %f %f %f %f %f %f %f \n", val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7]);
}

which works fine with gcc -mavx main.c as well as clang -mavx main.c and clang -O3 -mavx main.c .它适用于gcc -mavx main.c以及clang -mavx main.c和clang -O3 -mavx main.c 。

(Editor's note: it'll break with -ffast-math when cmpps treats those denormal inputs as 0.0 so all the compares are true. You want AVX2 _mm256_cmp_epi32 to do an integer compare, and _mm256_castsi256_ps the result. But that's unrelated to the question about gcc -O0 and clang allowing implicit conversion from __m256i to __m256 ) （编者注：当 cmpps 将这些非正规输入视为0.0时，它会与-ffast-math中断，因此所有比较都是真实的。您希望 AVX2 _mm256_cmp_epi32进行integer比较，并_mm256_castsi256_ps结果。但这与有关的问题无关gcc -O0和 clang 允许从__m256i到__m256的隐式转换）

However, when I use gcc -O3 -mavx main.c I get the following error message:但是，当我使用gcc -O3 -mavx main.c时，我收到以下错误消息：

main.c: In function ‘main’:
main.c:9:33: error: incompatible type for argument 1 of ‘_mm256_cmp_ps’
    9 |     __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);
      |                                 ^~
      |                                 |
      |                                 __m256i {aka __vector(4) long long int}
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:51,
                 from main.c:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/avxintrin.h:404:23: note: expected ‘__m256’ {aka ‘__vector(8) float’} but argument is of type ‘__m256i’ {aka ‘__vector(4) long long int’}
  404 | _mm256_cmp_ps (__m256 __X, __m256 __Y, const int __P)
      |                ~~~~~~~^~~
main.c:9:36: error: incompatible type for argument 2 of ‘_mm256_cmp_ps’
    9 |     __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);
      |                                    ^~
      |                                    |
      |                                    __m256i {aka __vector(4) long long int}
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:51,
                 from main.c:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/avxintrin.h:404:35: note: expected ‘__m256’ {aka ‘__vector(8) float’} but argument is of type ‘__m256i’ {aka ‘__vector(4) long long int’}
  404 | _mm256_cmp_ps (__m256 __X, __m256 __Y, const int __P)
      |                            ~~~~~~~^~~

I notice two things.我注意到两件事。 First of all, the compiler seems to treat is as __m256i {aka __vector(4) long long int} whereas it contains 8 ints.首先，编译器似乎将is视为__m256i {aka __vector(4) long long int}而它包含 8 个整数。 Secondly, the compiler is correct to complain, because the intel intrinsics guide 1 shows the arguments as __m256 .其次，编译器的抱怨是正确的，因为英特尔内在函数指南1将 arguments 显示为__m256 。 I'm now confused why this code even worked at the beginning.我现在很困惑为什么这段代码在一开始就起作用了。 And if it is indeed correct because the integers are casted to floats, then I don't understand why it doesn't work with gcc -O3 .如果它确实是正确的，因为整数被转换为浮点数，那么我不明白为什么它不适用于gcc -O3 。

I did not want to use _mm256_cmpeq_epi32 which returns an __m256i and there (seems to be no) is no blend_ps instruction that accepts such a mask.我不想使用返回_mm256_cmpeq_epi32的__m256i并且那里（似乎没有）没有接受这种掩码的blend_ps指令。

Why do the compilers behave differently, and what is the correct way to do this operation?为什么编译器的行为不同，执行此操作的正确方法是什么？

Compiler versions编译器版本

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-pkgversion='Arch Linux 9.3.0-1' --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++,d --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
gcc version 9.3.0 (Arch Linux 9.3.0-1)

$ clang -v
clang version 10.0.0 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version 10.1

[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/ [1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Answer 1

First of all, the compiler seems to treat is as __m256i {aka __vector(4) long long int} whereas it contains 8 ints.首先，编译器似乎将is视为__m256i {aka __vector(4) long long int}而它包含 8 个整数。

The __m128i and larger similar vectors don't specify the actual size (and number) of integers stored in them. __m128i和更大的相似向量不指定存储在其中的整数的实际大小（和数量）。 You can use the same __m128i type to store 16 uint8_t s or 2 uint64_t s or anything in between.您可以使用相同的__m128i类型来存储 16 个uint8_t或 2 个uint64_t或介于两者之间的任何内容。 The important part is that it is used to store integers.重要的部分是它用于存储整数。 It is operations on __m128i and larger similar vectors what specifies the interpretation of the verctors as a pack of integers of a given width.它是对__m128i和更大的相似向量的操作，它指定将向量解释为一组给定宽度的整数。 For example, both _mm_add_epi16 and _mm_add_epi32 accept __m128i arguments, but the first one interprets it as a vector of 8 uint16_t s, and the second - 4 uint32_t s.例如， _mm_add_epi16和_mm_add_epi32都接受__m128i arguments，但第一个将其解释为 8 uint16_t的向量，第二个 - 4 uint32_t s。

Secondly, the compiler is correct to complain, because the intel intrinsics guide 1 shows the arguments as __m256 .其次，编译器的抱怨是正确的，因为英特尔内在函数指南 1 将 arguments 显示为__m256 。

I think, the compiler is correct to complain.我认为，编译器的抱怨是正确的。 That it compiles the code with -O0 seems to be a compiler bug.它用-O0编译代码似乎是一个编译器错误。 In gcc, __m128i and other vectors are implemented using__attribute__((vector_size)) attributes, and the documentation says one should use __builtin_convertvector intrinsic to convert between vectors of different types.在 gcc 中， __m128i和其他向量是使用__attribute__((vector_size))属性实现的，文档说应该使用__builtin_convertvector intrinsic 在不同类型的向量之间进行转换。

The original definition of the __m128i and other vector types in Intel Software Developer's Manual, Section 3.1.1.10, doesn't say anything explicitly about convertibility of vectors of different types, though it does say this:英特尔软件开发人员手册第 3.1.1.10 节中__m128i和其他向量类型的原始定义没有明确说明不同类型向量的可转换性，尽管它确实是这样说的：

These SIMD data types are not basic Standard C data types or C++ objects, so they may be used only with the assignment operator, passed as function arguments, and returned from a function call.这些 SIMD 数据类型不是基本标准 C 数据类型或 C++ 对象，因此它们只能与赋值运算符一起使用，作为 function arguments 传递，并从 function 调用返回。

Given this, I gather that these vector types are not supposed to be implicitly convertible.鉴于此，我认为这些向量类型不应该是隐式可转换的。 You certainly cannot rely on that the conversion, if it does in fact compile, will have any particular behavior.你当然不能依赖转换，如果它确实编译，会有任何特定的行为。 That is especially given that integer vectors don't specify the size of their elements.特别是考虑到 integer 向量未指定其元素的大小。 Therefore, you should always use an intrinsic to define the type of conversion you want, eg _mm_cvtepi32_ps / _mm_cvtepi32_pd or _mm_castsi128_ps / _mm_castsi128_pd .因此，您应该始终使用内在函数来定义所需的转换类型，例如_mm_cvtepi32_ps / _mm_cvtepi32_pd或_mm_castsi128_ps / _mm_castsi128_pd 。

I did not want to use _mm256_cmpeq_epi32 which returns an __m256i and there (seems to be no) is no blend_ps instruction that accepts such a mask.我不想使用返回_mm256_cmpeq_epi32的__m256i并且那里（似乎没有）没有接受这种掩码的blend_ps指令。

_mm256_cmpeq_epi32 is AVX2, and there is _mm256_blendv_epi8 in AVX2. _mm256_cmpeq_epi32是AVX2，AVX2中有_mm256_blendv_epi8 。 If you're only limited to AVX then you have to operate on 128-bit integer vectors.如果您仅限于 AVX，那么您必须对 128 位 integer 向量进行操作。

Using _mm256_cmp_ps to operate on integer vectors is incorrect because its behavior is different from integer comparison.使用_mm256_cmp_ps对 integer 向量进行操作是不正确的，因为它的行为不同于 integer 比较。 In particular, there are special rules if at least one of the input operands matches a NaN bit pattern (eg with _CMP_EQ_OQ operand your comparison will always return 0 in the resulting vector element).特别是，如果至少一个输入操作数与 NaN 位模式匹配（例如，使用_CMP_EQ_OQ操作数，您的比较将始终在结果向量元素中返回0 ），则有特殊规则。

为什么 gcc -O3 处理 avx256 compare intrinsic 的方式不同于 gcc -O0 和 clang？

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-05-18 11:34:07

为什么 gcc -O3 处理 avx256 compare intrinsic 的方式不同于 gcc -O0 和 clang？

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-05-18 11:34:07

解决方案1
3 已采纳 2020-05-18 11:34:07