AVX2 _mm256_cmp_epi32_mask 的模拟

Question

I have 8 32-bit integers packed into __m256i registers.我将 8 个 32 位整数打包到__m256i寄存器中。 Now I need to compare corresponding 32-bit values in two registers.现在我需要比较两个寄存器中对应的 32 位值。 Tried试过了

__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);

that flags the equal pairs.标记相等的对。 That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.那太好了，但我遇到了“非法指令”异常，可能是因为我的处理器不支持 AVX512。

Looking for an analogous intrinsic to quickly get indexes of the equal pairs.寻找一个类似的内在函数来快速获取相等对的索引。

Found a work-around (there is no _mm256_movemask_epi32 );找到了解决方法（没有_mm256_movemask_epi32 ）； is the cast legal here?这里的演员表合法吗？

__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Answer 1

Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps .是的， cast内在函数只是对 YMM 寄存器中位的重新解释，它是 100% 合法的，是的，您希望编译器发出的 asm 是vpcmpeqd / vmovmaskps 。

Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8 .或者，如果您可以处理每个位重复 4 次， vpmovmskb也可以工作， _mm256_movemask_epi8 。 eg if you just want to test for any matches ( i != 0 ) or all-matches ( i == 0xffffffff ) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.例如，如果您只想测试任何匹配项（ i != 0 ）或所有匹配项（ i == 0xffffffff ），您可以避免在 integer 结果上使用ps指令，这可能会在关键路径中花费 1 个额外的旁路延迟周期.

But if that would cost you extra instructions to eg scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask.但是，如果在使用_mm_tzcnt_u32查找元素索引而不是第一个 1 的字节索引之后，这会花费您额外的指令来例如按 4 缩放，那么请使用_ps移动掩码。 The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.额外的指令肯定会花费延迟，并在流水线中占用一个插槽以提高吞吐量。

AVX2 _mm256_cmp_epi32_mask 的模拟

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-06 05:14:30

AVX2 _mm256_cmp_epi32_mask 的模拟

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-06 05:14:30

解决方案1
3 已采纳 2021-01-06 05:14:30