Since cast like this:
__m256d a;
uint64_t t[4];
_mm256_store_si256( (__m256i*)t, (__m256i)a );/* Cast of 'a' to __m256i not allowed */
are not allowed when compiling under Visual Studio, I thought I could use some intrinsic functions to convert a __m256d value into a __m256i before passing it to _mm256_store_si256 and thus, avoiding the cast which causes the error.
But after looking on that list , I couldn't find a function taking for argument a __m256d value and returning a __256i value. So maybe you could help me writing my own function or finding the function I'm looking for, a function that stores 4x 64-bit double bit value to an array of 4x64-bit integers.
EDIT:
After further research, I found _mm256_cvtpd_epi64 which seems to be exactly what I want. But, my CPU doesn't support AVX512 instructions set...
What is left for me to do here?
You could use _mm256_store_pd( (double*)t, a)
. I'm pretty sure this is strict-aliasing safe because you're not directly dereferencing the pointer after casting it. The _mm256_store_pd
intrinsic wraps the store with any necessary may-alias stuff.
(With AVX512, Intel switched to using void*
for the load/store intrinsics instead of float*
, double*
, or __m512i*
, to remove the need for these clunky casts and make it more clear that intrinsics can alias anything.)
The other option is to _mm256_castpd_si256
to reinterpret the bits of your __m256d
as a __m256i
:
alignas(32) uint64_t t[4];
_mm256_store_si256( (__m256i*)t, _mm256_castpd_si256(a));
If you read from t[]
right away, your compiler might optimize away the store/reload and just shuffle or pextrq rax, xmm0, 1
to extract FP bit patterns directly into integer registers. You could write this manually with intrinsics. Store/reload is not bad, though, especially if you want more than 1 of the double
bit-patterns as scalar integers.
You could instead use union m256_elements { uint64_t u64[4]; __m256d vecd; };
union m256_elements { uint64_t u64[4]; __m256d vecd; };
, but there's no guarantee that will compile efficiently.
This cast compiles to zero asm instructions, ie it's just a type-pun to keep the C compiler happy .
If you wanted to actually round packed double
to the nearest signed or unsigned 64-bit integer and have the result in 2's complement or unsigned binary instead of IEEE754 binary64, you need AVX512F _mm256/512_cvtpd_epi64
( vcvtpd2qq
) for it to be efficient. SSE2 + x86-64 can do it for scalar, or you can use some packed FP hacks for numbers in the [0..2^52]
range: How to efficiently perform double/int64 conversions with SSE/AVX? .
BTW, storeu
doesn't require an aligned destination, but store
does. If the destination is a local, you should normally align it instead of using an unaligned store, at least if the store happens in a loop, or if this function can inline into a larger function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.