简体   繁体   English

AVX2 _mm256_cmp_epi32_mask 的模拟

[英]Analog of _mm256_cmp_epi32_mask for AVX2

I have 8 32-bit integers packed into __m256i registers.我将 8 个 32 位整数打包到__m256i寄存器中。 Now I need to compare corresponding 32-bit values in two registers.现在我需要比较两个寄存器中对应的 32 位值。 Tried试过了

__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);

that flags the equal pairs.标记相等的对。 That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.那太好了,但我遇到了“非法指令”异常,可能是因为我的处理器不支持 AVX512。

Looking for an analogous intrinsic to quickly get indexes of the equal pairs.寻找一个类似的内在函数来快速获取相等对的索引。

Found a work-around (there is no _mm256_movemask_epi32 );找到了解决方法(没有_mm256_movemask_epi32 ); is the cast legal here?这里的演员表合法吗?

__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps .是的, cast内在函数只是对 YMM 寄存器中位的重新解释,它是 100% 合法的,是的,您希望编译器发出的 asm 是vpcmpeqd / vmovmaskps

Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8 .或者,如果您可以处理每个位重复 4 次, vpmovmskb也可以工作, _mm256_movemask_epi8 eg if you just want to test for any matches ( i != 0 ) or all-matches ( i == 0xffffffff ) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.例如,如果您只想测试任何匹配项( i != 0 )或所有匹配项( i == 0xffffffff ),您可以避免在 integer 结果上使用ps指令,这可能会在关键路径中花费 1 个额外的旁路延迟周期.

But if that would cost you extra instructions to eg scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask.但是,如果在使用_mm_tzcnt_u32查找元素索引而不是第一个 1 的字节索引之后,这会花费您额外的指令来例如按 4 缩放,那么请使用_ps移动掩码。 The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.额外的指令肯定会花费延迟,并在流水线中占用一个插槽以提高吞吐量。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在AVX2中重现_mm256_sllv_epi16和_mm256_sllv_epi8 - Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2 AVX2:有没有办法实现 _mm256_mul_epi8 函数的 2 的恒定幂? - AVX2: Is there a way to implement _mm256_mul_epi8 function for a constant power of 2? Visual Studio C编译器或Intel Intrinsics的AVX2“_mm256_set_epi64x”函数中的潜在错误 - Potential bug in Visual Studio C compiler or in Intel Intrinsics' AVX2 “_mm256_set_epi64x” function 有没有办法用AVX2写入_mm256_shldi_epi8(a,b,1)? (在向量之间每8位元素移动一位) - Is there a way to write _mm256_shldi_epi8(a,b,1) with AVX2? (Shift one bit per 8-bit element between vectors) 使用 AVX2 生成给定范围内的随机数,比 SVML _mm256_rem_epu32 余数更快? - Generate random numbers in a given range with AVX2, faster than SVML _mm256_rem_epu32 remainder? _mm256_loadu_epi64、_mm256_storeu_epi64 需要 avx512vl? - _mm256_loadu_epi64, _mm256_storeu_epi64 require avx512vl? _mm256_setr_epi32()的延迟和吞吐量 - Latency and Throughput of _mm256_setr_epi32() 如何使用 gcc 或 clang 模拟 _mm256_loadu_epi32? - How to emulate _mm256_loadu_epi32 with gcc or clang? AVX内在_mm256_cmp_ps应该在真实时返回NaN吗? - Is AVX intrinsic _mm256_cmp_ps supposed to return NaN when true? 如何对32位整数进行vblend? 或:为什么没有_mm256_blendv_epi32? - Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM