在 x86 上给出无分支 FP 最小值和最大值的指令是什么？

Question

To quote (thanks to the author for developing and sharing the algorithm!):引用（感谢作者开发和分享算法！）：

https://tavianator.com/fast-branchless-raybounding-box-intersections/ https://tavianator.com/fast-branchless-raybounding-box-intersections/

Since modern floating-point instruction sets can compute min and max without branches由于现代浮点指令集可以在没有分支的情况下计算最小值和最大值

Corresponding code by the author is just作者对应的代码只是

dmnsn_min(double a, double b)
{
  return a < b ? a : b;
}

I'm familiar with eg _mm_max_ps , but that's a vector instruction.我熟悉例如_mm_max_ps ，但这是一个向量指令。 The code above obviously is meant to be used in a scalar form.上面的代码显然旨在以标量形式使用。

Question:题：

What is the scalar branchless minmax instruction on x86? x86 上的标量无分支 minmax 指令是什么？ Is it a sequence of instructions?它是一个指令序列吗？
Is it safe to assume it's going to be applied, or how do I call it?假设它将被应用是否安全，或者我该如何称呼它？
Does it make sense to bother about branchless-ness of min/max?关心最小/最大的无分支性是否有意义？ From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch.据我了解，对于光线跟踪器和/或其他可视化软件，给定光线盒相交例程，分支预测器没有可靠的模式可供选择，因此消除分支确实有意义。 Am I right about this?我是对的吗？
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY.最重要的是，所讨论的算法是围绕与 (+/-) INFINITY 进行比较而构建的。 Is this reliable wrt the (unknown) instruction we're discussing and the floating-point standard?这对于我们正在讨论的（未知）指令和浮点标准是否可靠？

Just in case: I'm familiar with Use of min and max functions in C++ , believe it's related but not quite my question.以防万一：我熟悉Use of min and max functions in C++ ，相信它是相关的，但不是我的问题。

Answer 1

Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode;警告：注意编译器将_mm_min_ps / _mm_max_ps （和 _pd）内在函数视为可交换的，即使在严格的 FP（非快速数学）模式下也是如此； even though the asm instruction isn't.即使 asm 指令不是。 GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics ( _mm_max_ss has different behavior between clang and gcc , GCC bugzilla PR99497 ). GCC 似乎特别有这个错误： PR72867已在 GCC7 中修复，但可能会返回或永远不会修复_mm_min_ss等标量内在函数（ _mm_max_ss 在 clang 和 gcc 之间有不同的行为，GCC bugzilla PR99497 ）。

GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics. GCC 知道 asm 指令本身是如何工作的，并且在使用它们在纯标量代码中实现严格的 FP 语义时没有这个问题，只有使用 C/C++ 内在函数。

Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.不幸的是，没有一条指令可以实现fmin(a,b) （保证 NaN 传播），因此您必须在轻松检测问题与更高性能之间做出选择。

Most vector FP instructions have scalar equivalents .大多数向量 FP 指令都有标量等价物。 MINSS / MAXSS / MINSD / MAXSD are what you want. MINSS / MAXSS / MINSD / MAXSD 是您想要的。 They handle +/-Infinity the way you'd expect.他们以您期望的方式处理 +/-Infinity。

MINSS a,b exactly implements (a<b) ? a : b MINSS a,b完全实现(a<b) ? a : b (a<b) ? a : b according to IEEE rules , with everything that implies about signed-zero, NaN, and Infinities. (a<b) ? a : b根据 IEEE 规则，包含有关有符号零、NaN 和无穷大的所有内容。 (ie it keeps the source operand, b , on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a) , because those functions are based on the same expression. （即它保持源操作数b无序。）这意味着 C++ 编译器可以将它们用于std::min(b,a)和std::max(b,a) ，因为这些函数基于相同的表达。 Note the b,a operand order for the std:: functions , opposite Intel-syntax for x86 asm, but matching AT&T syntax.注意std:: functions 的b,a操作数顺序，与 x86 asm 的 Intel 语法相反，但匹配 AT&T 语法。

MAXSS a,b exactly implements (b<a) ? a : b MAXSS a,b完全实现(b<a) ? a : b (b<a) ? a : b , again keeping the source operand ( b ) on unordered. (b<a) ? a : b ，再次保持源操作数 ( b ) 无序。 Like std::max(b,a) .像std::max(b,a) 。

Looping over an array with x = std::min(arr[i], x);循环使用x = std::min(arr[i], x); (ie minss or maxss xmm0, [rsi] ) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. （即minss或maxss xmm0, [rsi] ）将从内存中获取一个 NaN（如果存在），然后获取接下来的任何非 NaN 元素，因为该比较将是无序的。 So you'll get the min or max of the elements following the last NaN.因此，您将获得最后一个 NaN 之后元素的最小值或最大值。 You normally don't want this, so it's only good for arrays that don't contain NaN.你通常不想要这个，所以它只适用于不包含 NaN 的数组。 But it means you can start with float v = NAN;但这意味着你可以从float v = NAN;开始float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists.在循环之外，而不是第一个元素或 FLT_MAX 或 +Infinity，并且可能简化处理可能为空的列表。 It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.它在 asm 中也很方便，允许使用pcmpeqd xmm0,xmm0 init 生成全 1 位模式（负 QNAN），但不幸的是 GCC 的 NAN 使用不同的位模式。

Demo/proof on the Godbolt compiler explorer , including showing that v = std::min(v, arr[i]); Godbolt 编译器资源管理器上的演示/证明，包括显示v = std::min(v, arr[i]); (or max) ignores NaNs in the array , at the cost of having to load into a register and then minss into that register. (或 max)忽略数组中的 NaN，代价是必须加载到寄存器中，然后将其放入该寄存器中。

(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.) （请注意，数组的 min 应使用向量，而不是标量；最好使用多个累加器来隐藏 FP 延迟。最后，减少到一个向量然后对其进行水平 min ，就像对数组求和或做点积一样。）

Don't try to use _mm_min_ss on scalar floats;不要尝试在标量浮点数上使用_mm_min_ss ； the intrinsic is only available with __m128 operands , and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work.内在函数仅适用于__m128操作数，并且英特尔的内在函数不提供任何方法来将标量浮点数放入__m128的低元素中，而无需将高元素归零或以某种方式进行额外工作。 Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements.大多数编译器实际上会发出无用的指令来执行此操作，即使最终结果不依赖于上层元素中的任何内容。 (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. （不过，Clang 通常可以避免这种情况，将 as-if 规则应用于死向量元素的内容。）没有什么像__m256 _mm256_castps128_ps256 (__m128 a)只是将浮点数转换为__m128 ，上面的元素中有垃圾。 I consider this a design flaw.我认为这是一个设计缺陷。 :/ ：/

But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you.但幸运的是，您不需要手动执行此操作，编译器知道如何为您使用 SSE/SSE2 min/max。 Just write your C such that they can.只需编写您的 C 以便他们可以。 The function in your question is ideal: as shown below (Godbolt link):您问题中的功能是理想的：如下所示（Godbolt链接）：

// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
  return a < b ? a : b;
}

Note their asymmetric behaviour with NaN : if the operands are unordered, dest=src (ie it takes the second operand if either operand is NaN).请注意它们与 NaN 的不对称行为：如果操作数是无序的，则 dest=src （即，如果任一操作数为 NaN，则采用第二个操作数）。 This can be useful for SIMD conditional updates, see below.这对于 SIMD 条件更新很有用，请参见下文。

( a and b are unordered if either of them is NaN. That means a<b , a==b , and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas .) （如果a和b中的任何一个是 NaN，则a和b是无序的。这意味着a<b 、 a==b和a>b都是假的。请参阅Bruce Dawson 的浮点系列文章，了解许多 FP 问题。）

The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.相应的_mm_min_ss / _mm_min_ps内在函数可能有也可能没有这种行为，这取决于编译器。

I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier.我认为内在函数应该具有与 asm 指令相同的操作数顺序语义，但是 gcc 已经将_mm_min_ps的操作数视为可交换的，即使很长时间没有-ffast-math ，gcc4.4 或更早。 GCC 7 finally changed it to match ICC and clang. GCC 7 终于改成匹配ICC和clang了。

Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive.英特尔的在线内在函数查找器没有记录该函数的行为，但它可能不应该是详尽无遗的。 The asm insn ref manual doesn't say the intrinsic doesn't have that property; asm insn ref 手册没有说内在没有那个属性； it just lists _mm_min_ss as the intrinsic for MINSS.它只是将_mm_min_ss列为_mm_min_ss的内在函数。

When I googled on "_mm_min_ps" NaN , I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction.当我在"_mm_min_ps" NaN上搜索时，我发现了这个真实的代码和其他一些关于使用内在函数来处理 NaN 的讨论，所以很明显很多人都希望内在函数表现得像 asm 指令。 (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.) （这是我昨天写的一些代码，我已经在考虑把它写成一个自我回答的问答。）

Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions.鉴于这个长期存在的 gcc 错误的存在，想要利用 MINPS 的 NaN 处理的可移植代码需要采取预防措施。 The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps .如果您的代码取决于_mm_min_ps的操作数顺序，那么许多现有 Linux 发行版上的标准 gcc 版本将错误编译您的代码。 So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative.所以你可能需要一个#ifdef来检测实际的 gcc（不是 clang 等），以及一个替代方案。 Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.或者只是首先做不同的事情：/也许使用_mm_cmplt_ps和布尔值 AND/ANDNOT/OR。

Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.启用-ffast-math还会使_mm_min_ps在所有编译器上_mm_min_ps可交换性。

As usual, compilers know how to use the instruction set to implement C semantics correctly .像往常一样，编译器知道如何使用指令集正确实现 C 语义。 MINSS and MAXSS are faster than anything you could do with a branch anyway , so just write code that can compile to one of those. MINSS 和 MAXSS比使用分支所能做的任何事情都要快，所以只需编写可以编译为其中之一的代码。

The commutative- _mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).该commutative- _mm_min_ps问题仅适用于内在：GCC知道MINSS / MINPS究竟是如何工作的，并用它们来正确地执行严格的FP语义（当你不使用-ffast-数学）。

You don't usually need to do anything special to get decent scalar code out of a compiler.您通常不需要做任何特殊的事情来从编译器中获得体面的标量代码。 But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.但是，如果你打算花时间照顾有关编译器使用什么指令，你应该通过手动矢量化你的代码，如果编译器不这样做启动。

(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.) （在极少数情况下，分支是最好的，如果条件几乎总是以一种方式发生并且延迟比吞吐量更重要。MINPS 延迟约为 3 个周期，但完美预测的分支为关键的依赖链增加了 0 个周期小路。）

In C++, use std::min and std::max , which are defined in terms of > or < , and don't have the same requirements on NaN behaviour that fmin and fmax do.在 C++ 中，使用std::min和std::max ，它们是根据>或<定义的，并且对 NaN 行为的要求与fmin和fmax不同。 Avoid fmin and fmax for performance unless you need their NaN behaviour.除非您需要它们的 NaN 行为，否则避免fmin和fmax以提高性能。

In C, I think just write your own min and max functions (or macros if you do it safely).在 C 中，我认为只需编写自己的min和max函数（或宏，如果你这样做是安全的）。

C & asm on the Godbolt compiler explorer Godbolt 编译器资源管理器上的 C & asm

float minfloat(float a, float b) {
  return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
    minss   xmm0, xmm1
    ret

// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
  # This implementation of std::min uses (b<a) : b : a;
  # So it can produce the result only in the register that b was in
  # This isn't worse (when inlined), just opposite
    minss   xmm1, xmm0
    movaps  xmm0, xmm1
    ret


float minfloat_fmin(float a, float b) { return fminf(a, b); }

# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
    movaps  xmm2, xmm0
    cmpunordss      xmm2, xmm2
    movaps  xmm3, xmm2
    andps   xmm3, xmm1
    minss   xmm1, xmm0
    andnps  xmm2, xmm1
    orps    xmm2, xmm3
    movaps  xmm0, xmm2
    ret
   # Obviously you don't want this if you don't need it.

If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.如果您想自己使用_mm_min_ss / _mm_min_ps ，请编写代码，即使没有 -ffast-math，编译器也能做出良好的 asm。

If you don't expect NaNs, or want to handle them specially, write stuff like如果您不期望 NaN，或者想特别处理它们，请编写类似的内容

lowest = _mm_min_ps(lowest, some_loop_variable);

so the register holding lowest can be updated in-place (even without AVX).所以保持lowest的寄存器可以就地更新（即使没有 AVX）。

Taking advantage of MINPS's NaN behaviour:利用 MINPS 的 NaN 行为：

Say your scalar code is something like说你的标量代码是这样的

if(some condition)
    lowest = min(lowest, x);

Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear.假设条件可以用 CMPPS 向量化，因此您有一个元素向量，其中的位全部设置或全部清除。 (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31 .) （或者，如果您只关心它们的符号而不关心负零，也许您可以直接在浮点数上使用 ANDPS/ORPS/XORPS。这会在符号位中创建一个真值，而在其他地方则是垃圾。BLENDVPS 看着只有符号位，所以这可能非常有用。或者你可以用PSRAD xmm, 31广播符号位。）

The straight-forward way to implement this would be to blend x with +Inf based on the condition mask.实现这一点的直接方法是根据条件掩码将x与+Inf混合。 Or do newval = min(lowest, x);或者做newval = min(lowest, x); and blend newval into lowest .并将 newval 混合到lowest . (either BLENDVPS or AND/ANDNOT/OR). （BLENDVPS 或 AND/ANDNOT/OR）。

But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it .但诀窍是全一位是 NaN，按位 OR 将传播它。 So:所以：

__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;


x = _mm_or_ps(x, condition);   // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest);  // NaN elements in x mean no change in lowest
//  REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
//  AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.

So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).所以只有 SSE2，我们在两个额外的指令（ORPS 和 MOVAPS，除非循环展开允许 MOVAPS 消失）中完成了条件 MINPS。

The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS.没有 SSE4.1 BLENDVPS 的替代方案是 ANDPS/ANDNPS/ORPS 进行混合，外加一个额外的 MOVAPS。 ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).无论如何，ORPS 比 BLENDVPS 更有效（在大多数 CPU 上它是 2 uop）。

Answer 2

Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers: Peter Cordes 的回答很棒，我只是想我会用一些更短的逐点回答：

What is the scalar branchless minmax instruction on x86? x86 上的标量无分支 minmax 指令是什么？ Is it a sequence of instructions?它是一个指令序列吗？

I was referring to minss / minsd .我指的是minss / minsd 。 And even other architectures without such instructions should be able to do this branchlessly with conditional moves.甚至没有此类指令的其他架构也应该能够通过条件移动无分支地执行此操作。

Is it safe to assume it's going to be applied, or how do I call it?假设它将被应用是否安全，或者我该如何称呼它？

gcc and clang will both optimize (a < b) ? a : b gcc和clang都会优化(a < b) ? a : b (a < b) ? a : b to minss / minsd , so I don't bother using intrinsics. (a < b) ? a : b到minss / minsd ，所以我不打扰使用内在函数。 Can't speak to other compilers though.虽然不能与其他编译器交谈。

Does it make sense to bother about branchless-ness of min/max?关心最小/最大的无分支性是否有意义？ From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch.据我了解，对于光线跟踪器和/或其他可视化软件，给定光线盒相交例程，分支预测器没有可靠的模式可供选择，因此消除分支确实有意义。 Am I right about this?我是对的吗？

The individual a < b tests are pretty much completely unpredictable, so it is very important to avoid branching for those.单个a < b测试几乎是完全不可预测的，因此避免对它们进行分支非常重要。 Tests like if (ray.dir.x != 0.0) are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize.像if (ray.dir.x != 0.0)这样的测试是非常可预测的，因此避免这些分支不太重要，但它确实缩小了代码大小并使向量化更容易。 The most important part is probably removing the divisions though.最重要的部分可能是删除分区。

Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY.最重要的是，所讨论的算法是围绕与 (+/-) INFINITY 进行比较而构建的。 Is this reliable wrt the (unknown) instruction we're discussing and the floating-point standard?这对于我们正在讨论的（未知）指令和浮点标准是否可靠？

Yes, minss / minsd behave exactly like (a < b) ? a : b是的， minss / minsd行为与(a < b) ? a : b完全一样(a < b) ? a : b (a < b) ? a : b , including their treatment of infinities and NaNs. (a < b) ? a : b ，包括它们对无穷大和 NaN 的处理。

Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.另外，我写了一篇后续文章，以更详细地讨论 NaN 和 min/max。

在 x86 上给出无分支 FP 最小值和最大值的指令是什么？

问题描述

2 个解决方案

解决方案1
27 已采纳 2016-10-23 03:09:00

Taking advantage of MINPS's NaN behaviour:利用 MINPS 的 NaN 行为：

解决方案2
2 2016-11-29 16:09:43

在 x86 上给出无分支 FP 最小值和最大值的指令是什么？

问题描述

2 个解决方案

解决方案1 27 已采纳 2016-10-23 03:09:00

Taking advantage of MINPS's NaN behaviour:利用 MINPS 的 NaN 行为：

解决方案2 2 2016-11-29 16:09:43

解决方案1
27 已采纳 2016-10-23 03:09:00

解决方案2
2 2016-11-29 16:09:43