Convert m256d to m256i

Question

Since cast like this:

 __m256d a;

uint64_t t[4];

_mm256_store_si256( (__m256i*)t, (__m256i)a );/* Cast of 'a' to __m256i not allowed */

are not allowed when compiling under Visual Studio, I thought I could use some intrinsic functions to convert a __m256d value into a __m256i before passing it to _mm256_store_si256 and thus, avoiding the cast which causes the error.

But after looking on that list , I couldn't find a function taking for argument a __m256d value and returning a __256i value. So maybe you could help me writing my own function or finding the function I'm looking for, a function that stores 4x 64-bit double bit value to an array of 4x64-bit integers.

EDIT:

After further research, I found _mm256_cvtpd_epi64 which seems to be exactly what I want. But, my CPU doesn't support AVX512 instructions set...

What is left for me to do here?

Answer 1

You could use _mm256_store_pd( (double*)t, a) . I'm pretty sure this is strict-aliasing safe because you're not directly dereferencing the pointer after casting it. The _mm256_store_pd intrinsic wraps the store with any necessary may-alias stuff.

(With AVX512, Intel switched to using void* for the load/store intrinsics instead of float* , double* , or __m512i* , to remove the need for these clunky casts and make it more clear that intrinsics can alias anything.)

The other option is to _mm256_castpd_si256 to reinterpret the bits of your __m256d as a __m256i :

alignas(32) uint64_t t[4];
_mm256_store_si256( (__m256i*)t,  _mm256_castpd_si256(a));

If you read from t[] right away, your compiler might optimize away the store/reload and just shuffle or pextrq rax, xmm0, 1 to extract FP bit patterns directly into integer registers. You could write this manually with intrinsics. Store/reload is not bad, though, especially if you want more than 1 of the double bit-patterns as scalar integers.

You could instead use union m256_elements { uint64_t u64[4]; __m256d vecd; }; union m256_elements { uint64_t u64[4]; __m256d vecd; }; , but there's no guarantee that will compile efficiently.

This cast compiles to zero asm instructions, ie it's just a type-pun to keep the C compiler happy .

If you wanted to actually round packed double to the nearest signed or unsigned 64-bit integer and have the result in 2's complement or unsigned binary instead of IEEE754 binary64, you need AVX512F _mm256/512_cvtpd_epi64 ( vcvtpd2qq ) for it to be efficient. SSE2 + x86-64 can do it for scalar, or you can use some packed FP hacks for numbers in the [0..2^52] range: How to efficiently perform double/int64 conversions with SSE/AVX? .

BTW, storeu doesn't require an aligned destination, but store does. If the destination is a local, you should normally align it instead of using an unaligned store, at least if the store happens in a loop, or if this function can inline into a larger function.

Convert m256d to m256i

Question

1 answers

solution1
1 ACCPTED 2018-06-24 16:33:59

Convert __m256d to __m256i

Question

1 answers

solution1 1 ACCPTED 2018-06-24 16:33:59

Convert m256d to m256i

solution1
1 ACCPTED 2018-06-24 16:33:59