简体   繁体   English

浮点相等比较的SIMD指令(NaN == NaN)

[英]SIMD instructions for floating point equality comparison (with NaN == NaN)

Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values? 哪些指令用于比较由4 * 32位浮点值组成的两个128位向量?

Is there an instruction that considers a NaN value on both sides as equal? 是否存在将双方的NaN值视为相等的指令? If not, how big would the performance impact of a workaround that provides reflexivity (ie NaN equals NaN) be? 如果不是,提供反身性的解决方案(即NaN等于NaN)的性能影响有多大?

I heard that ensuring reflexivity would have a significant performance impact compared with IEEE semantics, where NaN doesn't equal itself, and I'm wondering if big that impact would be. 我听说,与IEEE语义相比,确保反身性会产生显着的性能影响,因为NaN不等于自己,我想知道这种影响是否会很大。

I know that you typically want use epsilon comparisons instead of exact quality when dealing floating-point values. 我知道您在处理浮点值时通常需要使用epsilon比较而不是精确的质量。 But this question is about exact equality comparisons, which you could for example use to eliminate duplicate values from a hash-set. 但是这个问题是关于完全相等的比较,例如,您可以使用它来消除哈希集中的重复值。

Requirements 要求

  • +0 and -0 must compare as equal. +0-0必须比较相等。
  • NaN must compare equal with itself. NaN必须与自身相等。
  • Different representations of NaN should be equal, but that requirement might be sacrificed if the performance impact is too big. NaN的不同表示应该相等,但如果性能影响太大,可能会牺牲该要求。
  • The result should be a boolean, true if all four float elements are the same in both vectors and false if at least one element differs. 结果应该是一个布尔值, true如果所有四个浮动元件在两种载体相同,并且如果至少一个元件不同假。 Where true is represented by a scalar integer 1 and false by 0 . 其中true由标量整数1false0

Test cases 测试用例

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)

My idea for implementing this 我实现这一点的想法

I think it might be possible to combine two NotLessThan comparisons ( CMPNLTPS ?) using and to achieve the desired result. 我认为可以使用and组合两个NotLessThan比较( CMPNLTPS ?)来实现所需的结果。 The assembler equivalent of AllTrue(!(x < y) and !(y < x)) or AllFalse((x < y) or (y > x) . 汇编程序等效于AllTrue(!(x < y) and !(y < x))AllFalse((x < y) or (y > x)

Background 背景

The background for this question is Microsoft's plan to add a Vector type to .NET. 这个问题的背景是微软计划向.NET添加Vector类型。 Where I'm arguing for a reflexive .Equals method and need a clearer picture of how big the performance impact of this reflexive equals over a IEEE equals would be. 我正在争论一种反身的.Equals方法,并且需要更清晰地了解这种反身性能对IEEE等于.Equals性的影响。 See Should Vector<float>.Equals be reflexive or should it follow IEEE 754 semantics? 请参阅Vector<float>.Equals是自反的还是应该遵循IEEE 754语义? on programmers.se for the long story. 在程序员。长期的故事。

Even AVX VCMPPS (with it's greatly enhanced choice of predicates) doesn't give us a single-instruction predicate for this. 即使AVX VCMPPS(它具有极大增强的谓词选择)也没有为我们提供单指令谓词。 You have to do at least two compares and combine the results. 您必须至少进行两次比较并合并结果。 It's not too bad, though. 不过,这并不算太糟糕。

  • different NaN encodings aren't equal: effectively 2 extra insns (adding 2 uops). 不同的NaN编码相等:实际上有2个额外的insn(增加2个uop)。 Without AVX: One extra movaps beyond that. 没有AVX:超出一个额外的movaps

  • different NaN encodings are equal: effectively 4 extra insns (adding 4 uops). 不同的NaN编码相同的:有效4个额外的insn(增加4个uops)。 Without AVX: Two extra movaps insn 没有AVX:两个额外的movaps insn

An IEEE compare-and-branch is 3 uops: cmpeqps / movmskps / test-and-branch. IEEE比较和分支是3 cmpeqpscmpeqps / movmskps / test-and-branch。 Intel and AMD both macro-fuse the test-and-branch into a single uop/m-op. 英特尔和AMD都将测试和分支宏观融合为单个uop / m-op。

With AVX512: bitwise-NaN is probably just one extra instruction, since normal vector compare and branch probably uses vcmpEQ_OQps / ktest same,same / jcc , so combining two different mask regs is free (just change the args to ktest ). 使用AVX512:bitwise-NaN可能只是一个额外的指令,因为正常的向量比较和分支可能使用vcmpEQ_OQps / ktest same,same / jcc ,因此组合两个不同的掩码regs是免费的(只需将args更改为ktest )。 The only cost is the extra vpcmpeqd k2, xmm0,xmm1 . 唯一的成本是额外的vpcmpeqd k2, xmm0,xmm1

AVX512 any-NaN is just two extra instructions (2x VFPCLASSPS , with the 2nd one using the result of the first as a zeromask. See below). AVX512 any-NaN只是两个额外的指令(2x VFPCLASSPS ,第二个使用第一个作为零掩码的结果。见下文)。 Again, then ktest with two different args to set flag. 再次,然后ktest与两个不同的args设置标志。


My best idea so far: ieee_equal || bitwise_equal 到目前为止我最好的主意: ieee_equal || bitwise_equal ieee_equal || bitwise_equal

If we give up on considering different NaN encodings equal to each other: 如果我们放弃考虑相同的不同NaN编码:

  • Bitwise equal catches two identical NaNs. 按位相等可以捕获两个相同的NaN。
  • IEEE equal catches the +0 == -0 case. IEEE等于+0 == -0情况。

There are no cases where either compare gives a false positive (since ieee_equal is false when either operand is NaN: we want just equal, not equal-or-unordered. AVX vcmpps provides both options, while SSE only provides a plain equal operation.) 没有任何一种比较给出误报的情况(因为当任一操作数是NaN时ieee_equal为false:我们只需要相等,不等于或无序vcmpps提供两个选项,而SSE仅提供明显相等的操作。)

We want to know when all elements are equal, so we should start with inverted comparisons. 我们想知道所有元素何时相等,所以我们应该从反向比较开始。 It's easier to check for at least one non-zero element than to check for all elements being non-zero. 检查至少一个非零元素比检查所有非零元素更容易。 (ie horizontal AND is hard, horizontal OR is easy ( pmovmskb / test , or ptest ). Taking the opposite sense of a comparison is free ( jnz instead of jz ).) This is the same trick that Paul R used. (即水平AND很难,水平OR很容易( pmovmskb / testptest )。相反的比较是免费的( jnz而不是jz )。)这与Paul R使用的技巧相同。

; inputs in xmm0, xmm1
movaps    xmm2, xmm0    ; unneeded with 3-operand AVX instructions

cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.  -1:not ieee_equal.  predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd   xmm0, xmm1    ; -1:bitwise equal  0:otherwise

; xmm0   xmm2
;   0      0   -> equal   (ieee_equal only)
;   0     -1   -> unequal (neither)
;  -1      0   -> equal   (bitwise equal and ieee_equal)
;  -1     -1   -> equal   (bitwise equal only: only happens when both are NaN)

andnps    xmm0, xmm2    ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where  (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps  eax, xmm0
test      eax, eax      ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences

For double-precision, ...PS and pcmpeqQ will work the same. 对于双精度, ...PSpcmpeqQ将工作相同。

If the not-equal code goes on to find out which element isn't equal, a bit-scan on the movmskps result will give you the position of the first difference. 如果不相等的代码继续找出哪个元素不相等,则movmskps结果上的位扫描将给出第一个差异的位置。

With SSE4.1 PTEST you can replace andnps / movmskps /test-and-branch with: 使用SSE4.1 PTEST您可以使用andnps movmskps替换andnps / movmskps / test-and-branch:

ptest    xmm0, xmm2   ; CF =  0 == (NOT(xmm0) AND xmm2).
jc no_differences

I expect this is the first time most people have ever seen the CF result of PTEST be useful for anything. 我希望这是大多数人第一次看到PTESTCF结果对任何事情PTEST用。 :) :)

It's still three uops on Intel and AMD CPUs ( (2ptest + 1jcc) vs (pandn + movmsk + fused-test&branch)), but fewer instructions. 英特尔和AMD CPU((2ptest + 1jcc)vs(pandn + movmsk + fuse-test&branch))仍然是三个uops,但指令更少。 It is more efficient if you're going to setcc or cmovcc instead of jcc , since those can't macro-fuse with test . 如果你要高效setcccmovcc代替jcc ,因为那些不能宏观保险丝test

That makes a total of 6 uops (5 with AVX) for a reflexive compare-and-branch, vs. 3 uops for an IEEE compare-and-branch . 对于自反比较和分支,总共有6个uop(5个AVX),而IEEE比较和分支则为3个uop ( cmpeqps / movmskps / test-and-branch.) cmpeqps / movmskps / test-and-branch。)

PTEST has a very high latency on AMD Bulldozer-family CPUs ( 14c on Steamroller ). PTEST在AMD Bulldozer系列CPU上的延迟非常高( 在Steamroller上为14c )。 They have one cluster of vector execution units shared by two integer cores. 它们有一个由两个整数核共享的向量执行单元集群。 (This is their alternative to hyperthreading.) This increases the time until a branch mispredict can be detected, or the latency of a data-dependency chain ( cmovcc / setcc ). (这是超线程的替代方法。)这增加了可以检测到分支错误预测的时间,或者数据依赖链的延迟( cmovcc / setcc )。

PTEST sets ZF when 0==(xmm0 AND xmm2) : set if no elements were both bitwise_equal AND IEEE (neq or unordered). 0==(xmm0 AND xmm2)时,PTEST设置ZF :如果没有元素同时是bitwise_equal和IEEE(neq或unordered),则设置。 ie ZF is unset if any element was bitwise_equal while also being !ieee_equal . 即如果任何元素是bitwise_equal同时也是!ieee_equal则取消设置ZF。 This can only happen when a pair of elements contain bitwise-equal NaN s (but can happen when other elements are unequal). 这只能在一对元素包含按位相等的NaN时发生(但是当其他元素不相等时可能会发生)。

    movaps    xmm2, xmm0
    cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.
    pcmpeqd   xmm0, xmm1    ; -1:bitwise equal

    ptest    xmm0, xmm2
    jc   equal_reflexive   ; other cases

...

equal_reflexive:
    setnz  dl               ; set if at least one both-nan element

There's no condition that tests CF=1 AND anything about ZF . 没有条件测试CF=1和任何关于ZF事情。 ja tests CF=0 and ZF=1 . ja测试CF=0 and ZF=1 It's unlikely that you'd only want to test that anyway, so putting a jnz in the jc branch target works fine. 这是不可能的,你仍要测试,所以把一个jnzjc分支目标工作正常。 (And if you did only want to test equal_reflexive AND at_least_one_nan , a different setup could probably set flags appropriately). (如果你只想测试equal_reflexiveat_least_one_nan ,不同的设置可能会适当地设置标志)。


Considering all NaNs equal, even when not bitwise equal: 考虑到所有NaN都相等,即使不是按位相等:

This is the same idea as Paul R's answer, but with a bugfix (combine NaN check with IEEE check using AND rather than OR.) 这与Paul R的回答是一样的,但是有一个错误修正(使用AND而不是OR将NaN检查与IEEE检查结合起来。)

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
movaps      xmm3, xmm1
cmpordps    xmm3, xmm3      ; find NaNs in B
orps        xmm2, xmm3      ; 0:A and B are both NaN.  -1:anything else

cmpneqps    xmm0, xmm1      ; 0:IEEE equal (and ordered).  -1:unequal or unordered
; xmm0 AND xmm2  is zero where elements are IEEE equal, or both NaN
; xmm0   xmm2 
;   0      0     -> equal   (ieee_equal and both NaN (impossible))
;   0     -1     -> equal   (ieee_equal)
;  -1      0     -> equal   (both NaN)
;  -1     -1     -> unequal (neither equality condition)

ptest    xmm0, xmm2        ; ZF=  0 == (xmm0 AND xmm2).  Set if no differences in any element
jz   equal_reflexive
; else at least one element was unequal

;     alternative to PTEST:  andps  xmm0, xmm2 / movmskps / test / jz

So in this case we don't need PTEST 's CF result after all. 所以在这种情况下我们PTEST不需要PTESTCF结果。 We do when using PCMPEQD , because it doesn't have an inverse (the way cmpunordps has cmpordps ). 我们在使用PCMPEQDPCMPEQD ,因为它没有反转( cmpunordpscmpordps的方式)。

9 fused-domain uops for Intel SnB-family CPUs. 用于Intel SnB系列CPU的9个融合域uops。 (7 with AVX: use non-destructive 3-operand instructions to avoid the movaps .) However, pre-Skylake SnB-family CPUs can only run cmpps on p1, so this bottlenecks on the FP-add unit if throughput is a concern. (7使用AVX:使用非破坏性3操作数指令来避免movaps 。)但是,Skylake之前的SnB系列CPU只能在p1上运行cmpps ,因此如果吞吐量受到关注,则FP-add单元会出现这种瓶颈。 Skylake runs cmpps on p0/p1. Skylake在p0 / p1上运行cmpps

andps has a shorter encoding than pand , and Intel CPUs from Nehalem to Broadwell can only run it on port5. andps比更短的编码pand ,和英特尔CPU从Nehalem处理器到Broadwell微架构只能在PORT5运行它。 That may be desirable to prevent it from stealing a p0 or p1 cycle from surrounding FP code. 可能需要防止它从周围的FP代码中窃取p0或p1周期。 Otherwise pandn is probably a better choice. 否则pandn可能是更好的选择。 On AMD BD-family, andnps runs in the ivec domain anyway, so you don't avoid the bypass delay between int and FP vectors (which you might otherwise expect to manage if you use movmskps instead of ptest , in this version that only uses cmpps , not pcmpeqd ). 在AMD BD系列中, andnps无论如何都在ivec域中运行,所以你不要避免int和FP向量之间的旁路延迟(如果你使用movmskps而不是ptest ,你可能希望管理movmskps ,在这个版本中只使用cmpps ,而不是pcmpeqd )。 Also note that instruction ordering is chosen for human readability here. 另请注意,此处选择了指令排序以供人类阅读。 Putting the FP compare(A,B) earlier, before the ANDPS , might help the CPU get started on that a cycle sooner. ANDPS之前更早地进行FP比较(A,B)可能会帮助CPU更快地开始这个循环。

If one operand is reused, it should be possible to reuse its self-NaN-finding result. 如果重用一个操作数,则应该可以重用其自NaN查找结果。 The new operand still needs its self-NaN check, and a compare against the reused operand, so we only save one movaps / cmpps . 新的操作数仍然需要进行自我NaN检查,并与重用的操作数进行比较,因此我们只保存一个movaps / cmpps

If the vectors are in memory, at least one of them needs to be loaded with a separate load insn. 如果向量在内存中,则至少需要向其中一个载入单独的加载insn。 The other one can just be referenced twice from memory. 另一个可以从内存中引用两次。 This sucks if it's unaligned or the addressing mode can't micro-fuse , but could be useful. 如果它未对齐或寻址模式不能微熔合 ,这很糟糕,但可能有用。 If one of the operands to vcmpps is a vector known to not have any NaNs (eg a zeroed register), vcmpunord_qps xmm2, xmm15, [rsi] will find NaNs in [rsi] . 如果vcmpps的一个操作数是已知没有任何NaN的向量(例如归零寄存器),则vcmpunord_qps xmm2, xmm15, [rsi]将在[rsi]找到NaN。

If we don't want to use PTEST , we can get the same result by using the opposite comparisons, but combining them with the opposite logical operator (AND vs. OR). 如果我们不想使用PTEST ,我们可以通过使用相反的比较得到相同的结果,但是将它们与相反的逻辑运算符(AND与OR)组合。

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpunordps  xmm2, xmm2      ; find NaNs in A (-1:NaN  0:anything else)
movaps      xmm3, xmm1
cmpunordps  xmm3, xmm3      ; find NaNs in B
andps       xmm2, xmm3      ; xmm2 = (-1:both NaN  0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal

cmpeqps     xmm0, xmm1      ; -1:ieee_equal  0:unordered or unequal
; xmm0   xmm2 
;  -1      0     -> equal   (ieee_equal)
;  -1     -1     -> equal   (ieee_equal and both NaN (impossible))
;   0      0     -> unequal (neither)
;   0     -1     -> equal   (both NaN)

orps        xmm0, xmm2      ; 0: unequal.  -1:reflexive_equal
movmskps    eax, xmm0
test        eax, eax
jnz  equal_reflexive

Other ideas: unfinished, non-viable, broken, or worse-than-the-above 其他想法:未完成,不可行,破坏或比上述更糟糕

The all-ones result of a true comparison is an encoding of NaN . 真正比较的全部结果是NaN的编码。 ( Try it out . Perhaps we can avoid using POR or PAND to combine results from cmpps on each operand separately? 尝试一下 。也许我们可以避免使用PORPAND cmpps在每个操作数上组合来自cmpps结果?

; inputs in A:xmm0 B:xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed.  (This could be useful if ORPS didn't exist)

; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps  xmm2, xmm1
; A:NaN B:0   ->  0   unord 0   -> false
; A:0   B:NaN ->  NaN unord NaN -> true

; A:0   B:0   ->  NaN unord 0   -> true
; A:NaN B:NaN ->  0   unord NaN -> true

; Desired:   0 where A and B are both NaN.

cmpordps xmm2, xmm1 just flips the final result for each case, with the "odd-man-out" still on the 1st row. cmpordps xmm2, xmm1只是翻转每个案例的最终结果,第一行仍然是“odd-man-out”。

We can only get the result we want (true iff A and B are both NaN) if both inputs are inverted (NaN -> non-NaN and vice versa). 如果两个输入都被反转(NaN - >非NaN,反之亦然),我们只能得到我们想要的结果(如果A和B都是NaN则为真)。 This means we could use this idea for cmpordps as a replacement for pand after doing cmpordps self,self on both A and B. This isn't useful: even if we have AVX but not AVX2, we can use vandps and vandnps (and vmovmskps since vptest is AVX2 only). 这意味着我们可以使用这个想法cmpordps作为替代pand做后cmpordps self,self在A和B,这是没有用的:即使我们有AVX但不AVX2,我们可以使用vandpsvandnps (和vmovmskps因为vptest只是AVX2)。 Bitwise booleans are only single-cycle latency, and don't tie up the vector-FP-add execution port(s) which is already a bottleneck for this code. 按位布尔值只是单周期延迟,并不会占用已经成为此代码瓶颈的vector-FP-add执行端口。


VFIXUPIMMPS

I spent a while with the manual grokking its operation . 我花了一段时间与手册沟通其操作

It can modify a destination element if a source element is NaN, but that can't be conditional on anything about the dest element. 如果源元素是NaN,它可以修改目标元素,但不能以关于dest元素的任何内容为条件。

I was hoping I could think of a way to vcmpneqps and then fixup that result, once with each source operand (to elide the boolean instructions that combine the results of 3 vcmpps instructions). 我希望我能想到一种方法来vcmpneqps然后vcmpneqps结果,一次使用每个源操作数(以消除组合3 vcmpps指令结果的vcmpps指令)。 I'm now fairly sure that's impossible, because knowing that one operand is NaN isn't enough by itself make a change to the IEEE_equal(A,B) result. 我现在相当确定这是不可能的,因为知道一个操作数是NaN本身就不足以改变IEEE_equal(A,B)结果。

I think the only way we could use vfixupimmps is for detecting NaNs in each source operand separately, like vcmpunord_qps but worse. 我认为我们可以使用vfixupimmps的唯一方法是分别检测每个源操作数中的NaN,比如vcmpunord_qps但更糟。 Or as a really stupid replacement for andps , detecting either 0 or all-ones(NaN) in the mask results of previous compares. 或者作为andps一个非常愚蠢的替代andps ,在先前比较的掩码结果中检测0或全1(NaN)。


AVX512 mask registers AVX512掩码寄存器

Using AVX512 mask registers could help combine the results of compares. 使用AVX512掩码寄存器可以帮助组合比较结果。 Most AVX512 compare instructions put the result into a mask register instead of a mask vector in a vector reg, so we actually have to do things this way if we want to operate in 512b chunks. 大多数AVX512比较指令将结果放入掩码寄存器而不是向量寄存器中的掩码向量,因此如果我们想要以512b块运行,我们实际上必须这样做。

VFPCLASSPS k2 {k1}, xmm2, imm8 writes to a mask register, optionally masked by a different mask register. VFPCLASSPS k2 {k1}, xmm2, imm8写入掩码寄存器,可选地由不同的掩码寄存器屏蔽。 By setting only the QNaN and SNaN bits of the imm8, we can get a mask of where there are NaNs in a vector. 通过仅设置imm8的QNaN和SNaN位,我们可以获得向量中存在NaN的掩码。 By setting all the other bits, we can get the inverse. 通过设置所有其他位,我们可以得到逆。

By using the mask from A as a zero-mask for the vfpclassps on B, we can find the both-NaN positions with only 2 instructions, instead of the usual cmp/cmp/combine. 通过使用A中的掩码作为B上vfpclassps的零掩码,我们可以找到只有2个指令的两个NaN位置,而不是通常的cmp / cmp / combine。 So we save an or or andn instruction. 所以我们保存一个orandn指令。 Incidentally, I wonder why there's no OR-NOT operation. 顺便说一句,我想知道为什么没有OR-NOT操作。 Probably it comes up even less often than AND-NOT, or they just didn't want porn in the instruction set. 可能它比AND-NOT更少出现,或者他们只是不想在指令集中使用porn

Neither yasm nor nasm can assemble this, so I'm not even sure if I have the syntax correct! yasm和nasm都不能组合这个,所以我甚至不确定我的语法是否正确!

; I think this works

;  0x81 = CLASS_QNAN|CLASS_SNAN (first and last bits of the imm8)
VFPCLASSPS    k1,     zmm0, 0x81 ; k1 = 1:NaN in A.   0:non-NaN
VFPCLASSPS    k2{k1}, zmm1, 0x81 ; k2 = 1:NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
;; so k2 is like the bitwise-equal result from pcmpeqd: it's an override for ieee_equal

vcmpNEQ_UQps  k3, zmm0, zmm1
;; k3= 0 only where IEEE equal (because of cmpneqps normal operation)

;  k2   k3   ; same logic table as the pcmpeqd bitwise-NaN version
;  0    0    ->  equal   (ieee equal)
;  0    1    ->  unequal (neither)
;  1    0    ->  equal   (ieee equal and both-NaN (impossible))
;  1    1    ->  equal   (both NaN)

;  not(k2) AND k3 is true only when the element is unequal (bitwise and ieee)

KTESTW        k2, k3    ; same as PTEST: set CF from 0 == (NOT(k2) AND k2)
jc .reflexive_equal

We could reuse the same mask register as both zeromask and destination for the 2nd vfpclassps insn, but I used different registers in case I wanted to distinguish between them in a comment. 对于第二个vfpclassps insn,我们可以重复使用相同的掩码寄存器作为零掩码和目标,但是我想使用不同的寄存器以便在注释中区分它们。 This code needs a minimum of two mask registers, but no extra vector registers. 此代码至少需要两个掩码寄存器,但不需要额外的向量寄存器。 We could also use k0 instead of k3 as the destination for vcmpps , since we don't need to use it as a predicate, only as a dest and src. 我们也可以使用k0而不是k3作为vcmpps的目标,因为我们不需要将它用作谓词,只能用作dest和src。 ( k0 is the register that can't be used as a predicate, because that encoding means instead means "no masking".) k0是不能用作谓词的寄存器,因为编码意味着“没有屏蔽”。)

I'm not sure we could create a single mask with the reflexive_equal result for each element, without a k... instruction to combine two masks at some point (eg kandnw instead of ktestw ). 我不确定我们是否可以为每个元素创建一个带有reflexive_equal结果的单个掩码,而没有k...指令在某个点组合两个掩码(例如kandnw而不是ktestw )。 Masks only work as zero-masks, not one-masks that can force a result to one, so combining the vfpclassps results only works as an AND. 掩码仅用作零掩码,而不是一个可以强制结果为一个掩码的掩码,因此组合vfpclassps结果仅用作AND。 So I think we're stuck with 1-means-both-NaN, which is the wrong sense for using it as a zeromask with vcmpps . 所以我认为我们坚持使用1-means-both-NaN,这对于将它用作vcmpps的零掩码是vcmpps Doing vcmpps first, and then using the mask register as destination and predicate for vfpclassps , doesn't help either. 首先执行vcmpps ,然后使用掩码寄存器作为目标和vfpclassps谓词,也没有用。 Merge-masking instead of zero-masking would do the trick, but isn't available when writing to a mask register. 合并屏蔽而不是零屏蔽可以解决问题,但在写入屏蔽寄存器时不可用。

;;; Demonstrate that it's hard (probably impossible) to avoid using any k... instructions
vcmpneq_uqps  k1,    zmm0, zmm1   ; 0:ieee equal   1:unequal or unordered

vfpclassps    k2{k1}, zmm0, 0x81   ; 0:ieee equal or A is NaN.  1:unequal
vfpclassps    k2{k2}, zmm1, 0x81   ; 0:ieee equal | A is NaN | B is NaN.  1:unequal
;; This is just a slow way to do vcmpneq_Oqps: ordered and unequal.

vfpclassps    k3{k1}, zmm0, ~0x81  ; 0:ieee equal or A is not NaN.  1:unequal and A is NaN
vfpclassps    k3{k3}, zmm1, ~0x81  ; 0:ieee equal | A is not NaN | B is not NaN.  1:unequal & A is NaN & B is NaN
;; nope, mixes the conditions the wrong way.
;; The bits that remain set don't have any information from vcmpneqps left: both-NaN is always ieee-unequal.

If ktest ends up being 2 uops like ptest , and can't macro-fuse, then kmov eax, k2 / test-and-branch will probably be cheaper than ktest k1,k2 / jcc. 如果ktest最终像ptest那样是2 ktest ,并且不能宏融合,那么kmov eax, k2 / test-and-branch可能会比ktest k1,k2 / jcc便宜。 Hopefully it will only be one uop, since mask registers are more like integer registers, and can be designed from the start to be interally "close" to the flags. 希望它只有一个uop,因为屏蔽寄存器更像是整数寄存器,并且可以从一开始就设计为与标志“完全”“接近”。 ptest was only added in SSE4.1, after many generations of designs with no interaction between vectors and EFLAGS . ptest仅在SSE4.1中添加,经过多代设计,载体和EFLAGS之间没有相互作用。

kmov does set you up for popcnt, bsf or bsr, though. kmov确实为你设置了popcnt,bsf或bsr。 ( bsf / jcc doesn't macro-fuse, so in a search loop you're probably still going to want to test/jcc and only bsf when a non-zero is found. The extra byte to encode tzcnt doesn't buy you anything unless you're doing something branchless, because bsf still sets ZF on a zero input, even though the dest register is undefined. lzcnt gives 32 - bsr , though, so it can be useful even when you know the input is non-zero.) bsf / jcc没有宏融合,所以在搜索循环中你可能仍然想要测试/ jcc而只有bsf才能找到非零。编码tzcnt的额外字节不会给你买除非你做无bsf事情,因为bsf仍然在零输入上设置ZF,即使dest寄存器未定义。 lzcnt给出32 - bsr ,所以即使你知道输入是非零的,它也很有用。)

We can also use vcmpEQps and combine our results differently: 我们也可以使用vcmpEQps并以不同的方式组合我们的结果:

VFPCLASSPS      k1,     zmm0, 0x81 ; k1 = set where there are NaNs in A
VFPCLASSPS      k2{k1}, zmm1, 0x81 ; k2 = set where there are NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
vcmpEQ_OQps     k3, zmm0, zmm1
;; k3= 1 only where IEEE equal and ordered (cmpeqps normal operation)

;  k3   k2
;  1    0    ->  equal   (ieee equal)
;  1    1    ->  equal   (ieee equal and both-NaN (impossible))
;  0    0    ->  unequal (neither)
;  0    1    ->  equal   (both NaN)

KORTESTW        k3, k2  ; CF = set iff k3|k2 is all-ones.
jc .reflexive_equal

This way only works when there's a size of kortest that exactly matches the number of elements in our vectors. 这种方式只有当kortest的大小与我们向量中的元素数完全匹配时才有效。 eg a 256b vector of double-precision elements only has 4 elements, but kortestb still sets CF according to the low 8 bits of the input mask registers. 例如,256b双精度元素向量只有4个元素,但kortestb仍然根据输入掩码寄存器的低8位设置CF.


Using only integer ops 仅使用整数运算

Other than NaN, +/-0 is the only time when IEEE_equal is different from bitwise_equal. 除NaN外,+ / - 0是IEEE_equal与bitwise_equal不同的唯一时间。 (Unless I'm missing something. Double-check this assumption before using!) +0 and -0 have all their bits zero, except that -0 has the sign bit set (the MSB). (除非我遗漏了一些东西。在使用之前仔细检查这个假设!) +0-0所有位都为零,除了-0有符号位设置(MSB)。

If we ignore different NaN encodings, then bitwise_equal is the result we want, except in the the +/- 0 case. 如果我们忽略不同的NaN编码,那么bitwise_equal就是我们想要的结果,除了+/- 0情况。 A OR B will be 0 everywhere except the sign bit iff A and B are +/- 0. A left-shift by one makes it all-zero or not-all-zero for depending on whether or not we need to override the bitwise-equal test. 除了符号位之外, A OR B将是0,如果A和B是+/- 0,则左移1会使其全为零或不为全为零,这取决于我们是否需要重写按位 - 等量测试。

This uses one more instruction than cmpneqps , because we're emulating the functionality we need from it with por / paddD . 这使用了比cmpneqps更多的指令,因为我们使用por / paddD模拟我们需要的功能。 (or pslld by one, but that's one byte longer. It does run on a different port than pcmpeq , but you need to consider the the port distribution of the surrounding code to factor that into the decision.) (或pslld加1,但这比一个字节长。它确实运行在与pcmpeq不同的端口上,但你需要考虑周围代码的端口分布,以便将其纳入决策。)

This algorithm might be useful on different SIMD architectures that don't provide the same vector FP tests for detecting NaN. 该算法可能对不提供用于检测NaN的相同矢量FP测试的不同SIMD架构有用。

;inputs in xmm0:A  xmm1:B
movaps    xmm2, xmm0
pcmpeqd   xmm2, xmm1     ; xmm2=bitwise_equal.  (0:unequal -1:equal)

por       xmm0, xmm1
paddD     xmm0, xmm0     ; left-shift by 1 (one byte shorter than pslld xmm0, 1, and can run on more ports).

; xmm0=all-zero only in the +/- 0 case (where A and B are IEEE equal)

; xmm2     xmm0          desired result (0 means "no difference found")
;  -1       0        ->      0          ; bitwise equal and +/-0 equal
;  -1     non-zero   ->      0          ; just bitwise equal
;   0       0        ->      0          ; just +/-0 equal
;   0     non-zero   ->      non-zero   ; neither

ptest     xmm2, xmm0         ; CF = ( (not(xmm2) AND xmm0) == 0)
jc  reflexive_equal

The latency is lower than the cmpneqps version above, by one or two cycles. 延迟低于上面的cmpneqps版本一个或两个周期。

We're really taking full advantage of PTEST here: Using its ANDN between two different operands, and using its compare-against-zero of the whole thing. 我们在这里真正充分利用了PTEST :在两个不同的操作数之间使用它的ANDN,并使用它对整个事物的零比较。 We can't replace it with pandn / movmskps because we need to check all the bits, not just the sign bit of each element. 我们不能用pandn / movmskps替换它,因为我们需要检查所有位,而不仅仅是每个元素的符号位。

I haven't actually tested this, so it might be wrong even if my conclusion that +/-0 is the only time IEEE_equal is different from bitwise_equal (other than NaNs). 我实际上没有对此进行过测试,所以即使我的结论是+/- 0是IEEE_equal与bitwise_equal(除了NaNs)之外的唯一时间,也可能是错误的。


Handling non-bitwise-identical NaNs with integer-only ops is probably not worth it. 使用仅整数运算处理非按位相同的NaN可能不值得。 The encoding is so similar to +/-Inf that I can't think of any simple checks that wouldn't take several instructions. 编码与+/- Inf非常相似,我无法想到任何不需要多条指令的简单检查。 Inf has all the exponent bits set, and an all-zero mantissa. Inf具有设置的所有指数位和全零尾数。 NaN has all the exponent bits set, with a non-zero mantissa aka significand (so there are 23 bits of payload). NaN具有所有指数位设置,具有非零尾数aka有效数(因此有23位有效载荷)。 The MSB of the mantissa is interpreted as an is_quiet flag to distinguish signalling / quiet NaNs. 尾数的MSB被解释为is_quiet标志以区分信令/安静NaN。 Also see Intel manual vol1, table 4-3 ( Floating-Point Number and NaN Encodings ). 另请参阅英特尔手册vol1,表4-3( Floating-Point Number and NaN Encodings )。

If it wasn't for -Inf using the top-9-bits-set encoding, we could check for NaN with an unsigned compare for A > 0x7f800000 . 如果它不是-Inf使用前9位设置编码,我们可以检查NaN与A > 0x7f800000的无符号比较。 ( 0x7f800000 is single-precision +Inf). 0x7f800000是单精度+ Inf)。 However, note that pcmpgtd / pcmpgtq are signed integer compares. 但是,请注意pcmpgtd / pcmpgtq是有符号整数比较。 AVX512F VPCMPUD is an unsigned compare (dest = a mask register). AVX512F VPCMPUD是无符号比较(dest =掩码寄存器)。


The OP's idea: !(a<b) && !(b<a) OP的想法: !(a<b) && !(b<a)

The OP's suggestion of !(a<b) && !(b<a) can't work, and neither can any variation of it. OP的建议是!(a<b) && !(b<a)不起作用,也不能改变它。 You can't tell the difference between one NaN and two NaNs just from two compares with reversed operands. 你不能分辨出一个NaN和两个NaN之间的区别只是两个与反向操作数的比较。 Even mixing predicates can't help: No VCMPPS predicate differentiates one operand being NaN from both operands being NaN , or depends on whether it's the first or second operand that's NaN. 甚至混合谓词也无济于事: 没有VCMPPS谓词将一个操作数NaN与两个操作数NaN区VCMPPS ,或者取决于它是第一个还是第二个操作数NaN。 Thus, it's impossible for a combination of them to have that information. 因此,他们的组合不可能拥有这些信息。

Paul R's solution of comparing a vector with itself does let us detect where there are NaNs and handle them "manually". Paul R将矢量与自身进行比较的解决方案让我们可以检测出NaN的位置并“手动”处理它们。 No combination of results from VCMPPS between the two operands is sufficient, but using operands other than A and B does help. 两个操作数之间没有VCMPPS结果的组合就足够了,但使用AB以外A操作数确实有帮助。 (Either a known-non-NaN vector or same operand twice). (两次已知的非NaN向量或相同的操作数)。


Without the inversion, the bitwise-NaN code finds when at least one element is equal. 如果没有反转,则按位NaN代码会在至少一个元素相等时找到。 (There's no inverse for pcmpeqd , so we can't use different logical operators and still get a test for all-equal): pcmpeqd没有反转,所以我们不能使用不同的逻辑运算符,仍然可以测试全部相等):

; inputs in xmm0, xmm1
movaps   xmm2, xmm0
cmpeqps  xmm2, xmm1    ; -1:ieee_equal.  EQ_OQ predicate in the expanded notation for VEX encoding
pcmpeqd  xmm0, xmm1    ; -1:bitwise equal
orps     xmm0, xmm2
; xmm0 = -1:(where an element is bitwise or ieee equal)   0:elsewhere

movmskps eax, xmm0
test     eax, eax
jnz at_least_one_equal
; else  all different

PTEST isn't useful this way, since combining with OR is the only useful thing. PTEST在这方面没有用,因为与OR结合是唯一有用的东西。


// UNFINISHED start of an idea
bitdiff = _mm_xor_si128(A, B);
signbitdiff = _mm_srai_epi32(bitdiff, 31);   // broadcast the diff in sign bit to the whole vector
signbitdiff = _mm_srli_epi32(bitdiff, 1);    // zero the sign bit
something = _mm_and_si128(bitdiff, signbitdiff);

Here is one possible solution - it's not very efficient however, requiring 6 instructions: 这是一种可能的解决方案 - 但效率不高,需要6条指令:

__m128 v0, v1; // float vectors

__m128 v0nan = _mm_cmpeq_ps(v0, v0);                   // test v0 for NaNs
__m128 v1nan = _mm_cmpeq_ps(v1, v1);                   // test v1 for NaNs
__m128 vnan = _mm_or_si128(v0nan, v1nan);              // combine
__m128 vcmp = _mm_cmpneq_ps(v0, v1);                   // compare floats
vcmp = _mm_and_si128(vcmp, vnan);                      // combine NaN test
bool cmp = _mm_testz_si128(vcmp, vcmp);                // return true if all equal

Note that all the logic above is inverted, which may make the code a little hard to follow ( OR s are effectively AND s, and vice versa ). 请注意,上面的所有逻辑都是反转的,这可能会使代码难以理解( OR s实际上是AND s, 反之亦然 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM