[英]Analog of _mm256_cmp_epi32_mask for AVX2
I have 8 32-bit integers packed into __m256i
registers.我将 8 个 32 位整数打包到__m256i
寄存器中。 Now I need to compare corresponding 32-bit values in two registers.现在我需要比较两个寄存器中对应的 32 位值。 Tried试过了
__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);
that flags the equal pairs.标记相等的对。 That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.那太好了,但我遇到了“非法指令”异常,可能是因为我的处理器不支持 AVX512。
Looking for an analogous intrinsic to quickly get indexes of the equal pairs.寻找一个类似的内在函数来快速获取相等对的索引。
Found a work-around (there is no _mm256_movemask_epi32
);找到了解决方法(没有_mm256_movemask_epi32
); is the cast legal here?这里的演员表合法吗?
__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);
Yes, cast
intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd
/ vmovmaskps
.是的, cast
内在函数只是对 YMM 寄存器中位的重新解释,它是 100% 合法的,是的,您希望编译器发出的 asm 是vpcmpeqd
/ vmovmaskps
。
Or if you can deal with each bit being repeated 4 times, vpmovmskb
also works, _mm256_movemask_epi8
.或者,如果您可以处理每个位重复 4 次, vpmovmskb
也可以工作, _mm256_movemask_epi8
。 eg if you just want to test for any matches ( i != 0
) or all-matches ( i == 0xffffffff
) you can avoid using a ps
instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.例如,如果您只想测试任何匹配项( i != 0
)或所有匹配项( i == 0xffffffff
),您可以避免在 integer 结果上使用ps
指令,这可能会在关键路径中花费 1 个额外的旁路延迟周期.
But if that would cost you extra instructions to eg scale by 4 after using _mm_tzcnt_u32
to find the element index instead of byte index of the first 1, then use the _ps
movemask.但是,如果在使用_mm_tzcnt_u32
查找元素索引而不是第一个 1 的字节索引之后,这会花费您额外的指令来例如按 4 缩放,那么请使用_ps
移动掩码。 The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.额外的指令肯定会花费延迟,并在流水线中占用一个插槽以提高吞吐量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.