Analog of _mm256_cmp_epi32_mask for AVX2

Question

I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried

__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);

that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.

Looking for an analogous intrinsic to quickly get indexes of the equal pairs.

Found a work-around (there is no _mm256_movemask_epi32 ); is the cast legal here?

__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Answer 1

Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps .

Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8 . eg if you just want to test for any matches ( i != 0 ) or all-matches ( i == 0xffffffff ) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.

But if that would cost you extra instructions to eg scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.

Analog of _mm256_cmp_epi32_mask for AVX2

Question

1 answers

solution1
3 ACCPTED 2021-01-06 05:14:30

Analog of _mm256_cmp_epi32_mask for AVX2

Question

1 answers

solution1 3 ACCPTED 2021-01-06 05:14:30

solution1
3 ACCPTED 2021-01-06 05:14:30