简体繁体 English

AVX2整数比较，较小的等于

[英]AVX2 integer comparison for smaller equal

原文 2016-05-25 22:14:45 9 1 c/ integer/ compare/ avx/ avx2

What is the most efficient way to compare two 4x 64bit-Integer AVX vectors for <= . 比较两个<= 4x 64位整数AVX向量的最有效方法是什么。

From the Intel Intrinsics Guide we have 根据《英特尔内部技术指南》，我们拥有

_mm256_cmpgt_epi64(__m256i a, __m256i b) = a > b _mm256_cmpgt_epi64(__m256i a, __m256i b) = a> b
_mm256_cmpeq_epi64(__m256i a, __m256i b) = a == b _mm256_cmpeq_epi64(__m256i a, __m256i b) = a == b

for comparisons 比较

and 和

_mm256_and_si256(__m256i a, __m256i b) = a & b _mm256_and_si256(__m256i a, __m256i b) = a＆b
_mm256_andnot_si256(__m256i a, __m256i b) = ~a & b _mm256_andnot_si256(__m256i a, __m256i b) = _mm256_andnot_si256(__m256i a, __m256i b) ＆b
_mm256_or_si256(__m256i a, __m256i b) = a | _mm256_or_si256(__m256i a, __m256i b) = a | b b
_mm256_xor_si256(__m256i a, __m256i b) = a ^ b _mm256_xor_si256(__m256i a, __m256i b) = a ^ b

for logical operations. 用于逻辑运算。

My approach was: 我的方法是：

// check = ( a <= b ) = ~(a > b) & 0xF..F
__m256i a = ...
__m256i b = ...
__m256i tmp = _mm256_cmpgt_epi64(a, b)
__m256i check = _mm256_andnot_si256(tmp, _mm256_set1_epi64x(-1))

1 个解决方案

You're right that there's no direct way to get the mask you really want, only an inverted mask: A gt B = A nle B . 没错，没有直接方法可以得到您真正想要的蒙版，只有倒置蒙版： A gt B = A nle B

There's no vector-NOT instruction, so you do need a vector of all-ones as well as an extra instruction to invert a vector. 没有vector-NOT指令，因此您确实需要一个全为1的向量以及一条额外的指令来反转向量。 (Or a vector of all-zero and _mm256_cmpeq_epi8 , but that can't run on as many execution ports as _mm256_xor_si256 with an all-ones vector.) See the x86 tag wiki for performance info, esp. （或者是全零和_mm256_cmpeq_epi8的向量，但是不能与具有全一向量的_mm256_xor_si256一样多的执行端口上运行。）有关性能信息，请参见x86标记Wiki，尤其是。 Agner Fog's guide. Agner Fog的指南。

The other bitwise boolean option, _mm256_andn_si256 is just as good as xor. 另一个按位布尔选项_mm256_andn_si256与xor一样好。 It isn't commutative, and slightly more complicated to mentally verify that you got it right. 它不是可交换的，并且在心理上验证您的设置正确时要稍微复杂一些。 xor-with-all-ones is a good idiom for flip-all-the-bits. 与所有人一起进行异或运算是所有位翻转的一个好习惯。

Instead of spending an instruction inverting the mask, in most code it's possible to just use it the opposite way. 在大多数代码中，无需花费指令来反转掩码，而可以相反的方式使用它。

eg if it's an input to a blendv , then reverse the order of the operands to the blend. 例如，如果它是blendv的输入，则将操作数的顺序反转为blend。 Instead of 代替
_mm256_blendv_epi8(a, b, A_le_B_mask) , use _mm256_blendv_epi8(a, b, A_le_B_mask) ，使用
_mm256_blendv_epi8(b, a, A_nle_B_mask)

If you were going to _mm_and something with the mask, use _mm_andn instead. 如果你要_mm_and东西与面罩，使用_mm_andn代替。

If you were going to _mm_movemask and test for all-zero, you can instead test for all-ones. 如果要使用_mm_movemask并测试全零，则可以改为测试全零。 It will compile to a cmp eax, -1 instruction instead of a test eax,eax , which is just as efficient. 它将编译为cmp eax, -1指令，而不是test eax,eax ，同样有效。 If you were going to bitscan for the first 1, you will have to invert it. 如果要在前1个位使用bitcan，则必须将其反转。 An integer not instruction (from using ~ on the movemask result) is cheaper than doing it on the vector. 整数not指令（在movemask结果上使用~ ）比在向量上执行便宜。

You only have a problem if you were going to OR or XOR, because those instruction don't come in flavours that negate one of their inputs. 如果您要进行“或”或“异或”运算，则只会有一个问题，因为这些指令不会以否定其输入之一的方式出现。 (IDK if Intel just didn't want to add a PORN mnemonic, but probably PAND and PANDN get more use, esp. before variable-blend instructions. （IDK如果Intel只是不想添加一个PORN助记符，但是PAND和PANDN可能会得到更多的使用，尤其是在可变混合指令之前。