I have a codebase which contains AVX512 intrinsic instructions and was build using intel compiler. I am trying to run the same thing using GNU compiler. While compiling the code with -mavx512f flag using gcc, I am getting declaration error only for some AVX512 instructions like _mm512_mask_i32logather_pd .
Standalone Implementation
#include <iostream>
#include <immintrin.h>
int main() {
__m512d set = _mm512_undefined_pd();
__mmask16 msk = 42440;
__m512i v_index = _mm512_set_epi32(64,66,70,96,98,100,102,104,106,112,114,116,118,120,124,256);
int scale = 8;
int count_size = 495*4;
float *src_ptr = (float*)malloc(count_size*sizeof(float));
__m512 out_512 = (__m512)_mm512_mask_i32logather_pd(set, msk, v_index, (float*)src_ptr, _MM_SCALE_8);
return 0;
}
After running this standalone implementation for the function through gcc I am getting the error as
error: ‘_mm512_mask_i32logather_pd’ was not declared in this scope; did you mean ‘_mm512_mask_i32gather_pd’?
Running the same code using icc with -xCORE-AVX512 flag runs perfectly fine.
Is this because the GNU compiler doesn't support all the AVX512 instructions even though most of the instructions works perfectly fine by using -mavx512f flag?
Relevant information
GCC has intrinsics for all AVX-512 instructions. It doesn't always have every alternate version of every intrinsic that differ only in their C semantics, not the underlying instruction they expose.
I think the only difference between the regular _mm512_mask_i32gather_pd
intrinsic (which GCC supports) is that logather
takes a __m512i vindex
instead of __m256i
. But uses only the low half , hence the lo
in the name. (I looked at them in the intrinsics guide - same pseudocode, just a difference in C/C++ function signature. And they're listed as intrinsics for the same single instruction). There doesn't seem to be a higather
intrinsic that includes a shuffle; you need to do the extracting yourself.
vgatherdpd
gathers 8 double
elements to fill a __m512d
, using 32-bit indices. The corresponding 8 indices are only a total of 32 bytes wide. That's why the regular more widely-supported intrinsic only takes a __m256i vindex
arg.
Your code strangely bothers to initialize 64 bytes (16 indices), not shuffling the high half down. Also you're merge-masking into _mm512_undefined_pd()
, which seems a weird example. But pretty obviously this isn't intended to be useful, since you're also loading from uninitialized malloc
. You're casting the result to a __m512
, I guess using this instruction to gather pairs of float
instead of individual double
s? If so, yeah it's more efficient to gather fewer elements, but it's a weird way to make a minimal simple example for an intrinsic you're looking for. I wonder if perhaps you were looking for _mm512_mask_i32gather_ps
to gather 16x float
elements , merging into a __m512
vector. (The non- _mask_
version gathers all 16 elements, and you don't have to supply a merge target; that's often what you want.)
If you do have your 8 indices in a wider vector for some reason (eg as a result of computation and you're going to do 2 gathers after shuffling), you can just cast the vector type:
__m512i vindex = ...; // the part we want is only the low half
__m512d result = something to merge into;
result = _mm512_mask_i32gather_pd(result, mask, _mm512_castsi512_si256(vindex),
src_ptr, _MM_SCALE_8);
Your cast to (float*)
in the arg list to the intrinsic makes no sense: it actually takes a void*
so you can gather 64-bit chunks from anything (and yes it's strict-aliasing and alignment safe, not following C rules). But the normal type would be double*
, since this is a _pd
gather.
In your example, it would be simpler to just __m256 vindex = _mm256_setr_epi32(...);
(Or set
, if you like the highest-element-first order for the argument list.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.