简体   繁体   中英

Analog of _mm256_cmp_epi32_mask for AVX2

I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried

__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);

that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.

Looking for an analogous intrinsic to quickly get indexes of the equal pairs.

Found a work-around (there is no _mm256_movemask_epi32 ); is the cast legal here?

__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps .

Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8 . eg if you just want to test for any matches ( i != 0 ) or all-matches ( i == 0xffffffff ) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.

But if that would cost you extra instructions to eg scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM