[英]Checking If A Vector Contains Any Element Greater Than Zero
I will be thankful if somebody can help in writing a function that receives an AVX vector and checks if it contains any element greater than zero .. 如果有人可以帮助编写一个接收AVX向量的函数并检查它是否包含任何大于零的元素,我将感激不尽。
I have written the following code but it is not optimal because it stores the elements and then manipulate it.. the vector should be checked as a whole. 我编写了以下代码,但它不是最佳的,因为它存储元素然后操纵它。矢量应该作为一个整体进行检查。
int check(__m256 vector)
{
float * temp;
posix_memalign ((void **) &temp, 32, 8 * sizeof(float));
_mm256_store_ps( temp, vector );
int flag=0;
for(int k=0; k<8; k++)
{
flag= ( (temp[k]>0) ? 1 : 0 );
if (flag==1) return 1;
}
free( temp);
return 0;
}
If you're going to branch on the result, it's usually fewer uops to use the "traditional" compare / movemask / integer-test, like you would with SSE1. 如果您要对结果进行分支,那么使用“传统”比较/ movemask /整数测试通常会减少uops,就像使用SSE1一样。
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
int cmp = _mm256_movemask_ps(vcmp);
if (cmp)
return 1;
This typically compiles to something like 这通常会编译成类似的东西
vcmplt_oqps ymm2, ymm0, ymm1
vpmovmskb eax, ymm2
test eax,eax
jnz .true_branch
Those are all single-uop instructions, and test/jnz macro-fuse on Intel and AMD CPUs that support AVX, so this is only 3 total uops (on Intel). 这些都是单uop指令,以及支持AVX的Intel和AMD CPU上的test / jnz宏保险丝,所以这只有3个uop(在Intel上)。
See Agner Fog's instruction tables + microarch guide , and other guides linked from https://stackoverflow.com/tags/x86/info . 请参阅Agner Fog的说明表+微指南指南 ,以及从https://stackoverflow.com/tags/x86/info链接的其他指南。
You can also use PTEST, but it's less efficient for this case. 您也可以使用PTEST,但这种情况效率较低。 See _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128 请参阅_mm_testc_ps和_mm_testc_pd与_mm_testc_si128
Without AVX, ptest
handy for checking if a register is all-zero without needing extra instructions to copy it (since it sets integer flags directly). 没有AVX, ptest
很方便检查寄存器是否全为零而无需额外的指令来复制它(因为它直接设置整数标志)。 But since it's 2 uops, and can't macro-fuse with a jcc
branch instruction, it's actually worse than the above: 但是由于它是2 uops,并且不能用jcc
分支指令进行宏融合,它实际上比上面更糟糕:
// don't use, sub-optimal
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
if (!_mm256_testz_si256(vcmp, vcmp)) {
return 1;
}
The testz
intrinsic is PTEST
. testz
内在是PTEST
。 It sets the ZF and CF flags directly based on the results of AND and AND NOT of its args. 它根据args的AND和AND NOT的结果直接设置ZF和CF标志。 The testz intrinsic is true when vcmp
has any non-zero bits. 当vcmp
具有任何非零位时,testz内在函数为真。 (which it will only when vcmpps
puts some there.) (只有当vcmpps
放在那里时vcmpps
这样。)
VPTEST
with ymm regs is available with just AVX. VPTEST
与青运暂存器可只AVX。 AVX2 isn't required even though it looks like a vector-integer instruction. 即使它看起来像向量整数指令,也不需要AVX2。
This will compile to something like 这将编译成类似的东西
vcmplt_oqps ymm2, ymm0, ymm1
vptest ymm2, ymm2
jnz .true_branch
Probably smaller code-size than the above, but this is actually 4 uops instead of 3. If you were using setnz
or cmovnz
, macro-fusion wouldn't be a factor, so ptest
would be break-even. 可能比上面的代码大小更小,但这实际上是4 setnz
而不是3.如果你使用setnz
或cmovnz
,宏融合将不是一个因素,所以ptest
将是收支平衡的。 As I mentioned above, the main use-case for ptest
is when you can use it without a compare instruction, and without AVX. 正如我上面提到的, ptest
的主要用例是你可以在没有比较指令且没有AVX的情况下使用它。
The alternative for checking a vector for all-zero ( pcmpeqb xmm0,xmm1
/ pmovmskb eax, xmm1
/ test eax,eax
) has to destroy one of the input vectors without AVX, so it will require an extra movdqa
instruction to copy if you still need both after the test. 检查全零向量的替代方法( pcmpeqb xmm0,xmm1
/ pmovmskb eax, xmm1
/ test eax,eax
)必须在不使用AVX的情况下销毁其中一个输入向量,因此如果仍然需要额外的movdqa
指令进行复制测试后需要两者。
ptest
floating point bit-hacks ptest
浮点钻头 I think for this specific test, it might be possible to skip the compare instruction and use vptest
directly to see if there are any float
elements with their sign bit unset, but some non-zero bits elsewhere. 我认为对于这个特定的测试,可能会跳过比较指令并直接使用vptest
来查看是否有任何float
元素的符号位未设置,但其他地方有一些非零位。
Actually no, that idea can't work, because it doesn't respect element boundaries . 实际上不,这个想法不起作用,因为它不尊重元素边界 。 It couldn't tell the difference between a vector with a positive element vs. a vector with a +0.0
element (sign bit clear) and another element that was negative (other bits set). 它无法区分具有正元素的向量与具有+0.0
元素的向量(符号位清除)和具有负向的其他元素(其他位设置)之间的差异。
vptest
sets CF= bool(~src1 & src2)
and ZF= (src1 & src2)
. vptest
设置CF = bool(~src1 & src2)
和ZF = (src1 & src2)
。 I was thinking that src1= set1(0x7FFFFFFF)
could tell us something useful about sign bits and non-sign bits, which we could test with a condition that checks CF and ZF. 我在想src1 = set1(0x7FFFFFFF)
可以告诉我们关于符号位和非符号位的有用信息,我们可以用检查CF和ZF的条件进行测试。 For example ja
: CF=0 and ZF=0. 例如ja
:CF = 0且ZF = 0。 There actually isn't an x86 condition that's only true with CF=1 and ZF=0, though, so that's another problem. 实际上没有x86条件只有在CF = 1 和 ZF = 0时才是真的,所以这是另一个问题。
Also NaN > 0
is false, but NaN has some set bits. NaN > 0
也是假的,但是NaN有一些设置位。 (exponent all-ones, mantissa non-zero, sign-bit = don't care so there can be +NaN and -NaN). (指数全1,尾数非零,符号位=不关心所以可以有+ NaN和-NaN)。 If that was the only problem, this would still be useful in cases where NaN-handling isn't required. 如果这是唯一的问题,那么在不需要NaN处理的情况下,这仍然有用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.