简体   繁体   English

可以在运行时优化浮点乘以零吗?

[英]Can floating point multiplication by zero be optimised at runtime?

I am writing an algorithm to find the inverse of an nxn matrix. 我正在编写一个算法来查找nxn矩阵的逆。 Let us take the specific case of a 3x3 matrix. 让我们来看一个3x3矩阵的具体情况。

When you invert a matrix by hand, you typically look for rows/columns containing one or more zeros to make the determinant calculation faster as it eliminates terms you need to calculate. 手动反转矩阵时,通常会查找包含一个或多个零的行/列,以便更快地执行行列式计算,因为它会消除您需要计算的项。

Following this logic in C/C++, if you identify a row/column with one or more zeros, you will end up with the following code: 在C / C ++中遵循此逻辑,如果您标识具有一个或多个零的行/列,则最终将得到以下代码:

float term1 = currentElement * DetOf2x2(...);
//           ^
//           This is equal to 0.
//
// float term2 = ... and so on.

As the compiler cannot know currentElement will be zero at compile time, it cannot be optimised to something like float term = 0; 由于编译器无法知道currentElement在编译时将为零,因此无法将其优化为float term = 0; and thus the floating point multiplication will be carried out at runtime. 因此浮点乘法将在运行时执行。

My question is, will these zero values make the floating point multiplication faster, or will the multiplication take the same amount of time regardless of the value of currentElement ? 我的问题是,这些零值是否会使浮点乘法更快,或者无论currentElement的值如何,乘法都会占用相同的时间量? If there is no way of optimising the multiplication at runtime, then I can remove the logic that searches for rows/columns containing zeros. 如果无法在运行时优化乘法,那么我可以删除搜索包含零的行/列的逻辑。

The compiler is not allowed to optimize this unless the calculation is trival (eg all constants). 除非计算是trival(例如所有常量),否则不允许编译器优化它。

The reason is, that DetOf2x2 may return a NAN floating point value. 原因是,DetOf2x2可能返回NAN浮点值。 Multiplying a NAN with zero does not return zero but a NAN again. 将NAN与零相乘不会返回零,而是再次返回NAN。

You can try it yourself using this little test here: 您可以在此处使用此小测试自行尝试:

int main (int argc, char **args)
{
  // generate a NAN
  float a = sqrt (-1);

  // Multiply NAN with zero..
  float b = 0*a;

  // this should *not* output zero
  printf ("%f\n", b);
}

If you want to optimize your code, you have to test for zero on your own. 如果要优化代码,则必须自行测试零。 The compiler will not do that for you. 编译器不会为您执行此操作。

float term1 = currentElement * DetOf2x2(...);

The compiler will call DetOf2x2(...) even if currentElement is 0: that's sure to be far more costly than the final multiplication, whether by 0 or not. 即使currentElement为0,编译器也会调用DetOf2x2(...) :这肯定比最终的乘法要DetOf2x2(...)得多,无论是否为0。 There are multiple reasons for that: 原因有很多:

  • DetOf2x2(...) may have side effects (like output to a log file) that need to happen even when currentElement is 0 , and DetOf2x2(...)可能有副作用(如输出到日志文件),即使currentElement0也需要发生这种副作用,
  • DetOf2x2(...) may return values like the Not-a-Number / NaN sentinel that should propagate to term1 anyway (as noted first by Nils Pipenbrinck) DetOf2x2(...)可以返回值,无论如何都应传播到term1的非数字/ NaN标记(如Nils Pipenbrinck首先提到的那样)

Given DetOf2x2(...) is almost certainly working on values that can only be determined at run-time, the latter possibility can't be ruled out at compile time. 鉴于DetOf2x2(...)几乎肯定会处理只能在运行时确定的值,后者的可能性不能在编译时排除。

If you want to avoid the call to Detof2x2(...) , try: 如果你想避免调用Detof2x2(...) ,请尝试:

float term1 = (currentElement != 0) ? currentElement * DetOf2x2(...) : 0;

Modern CPUs will actually handle a multiply-by-zero very quickly, more quickly than a general multiply, and much more quickly than a branch. 现代的CPU会比实际的分支处理乘法由零速度非常快,快于一般的乘法,并迅速。 Don't even bother trying to optimize this unless that zero is going to propagate through at least several dozen instructions. 甚至不打算尝试优化它,除非零将通过至少几十个指令传播。

Optimisations performed at runtime are known as JIT (just-in-time) optimisations. 在运行时执行的优化称为JIT(即时)优化。 Optimisations performed at translation (compilation) are known as AOT (ahead-of-time) optimisations. 在翻译(编译)时执行的优化称为AOT(提前)优化。 You're referring to JIT optimisations. 你指的是JIT优化。 A compiler might introduce JIT optimisations into your machine code, but it's certainly a far more complex optimisation to implement than the common AOT optimisations. 编译器可能会在您的机器代码中引入JIT优化,但它实现的优化要比常见的AOT优化要复杂得多。 Optimisations are typically implemented based on significance, and this kind of "optimisation" might be seen to affect other algorithms negatively. 优化通常基于重要性来实现,并且这种“优化”可能被视为负面地影响其他算法。 C implementations aren't required to perform any of these optimisations. C实现不需要执行任何这些优化。

You could provide the optimisation manually, which would be "the logic that searches for rows/columns containing zeros", or something like this: float term1 = currentElement != 0 ? currentElement * DetOf2x2(...) : 0; 您可以手动提供优化,这将是“搜索包含零的行/列的逻辑”,或者类似这样的内容: float term1 = currentElement != 0 ? currentElement * DetOf2x2(...) : 0; float term1 = currentElement != 0 ? currentElement * DetOf2x2(...) : 0;

The following construct is valid at compile time when the compiler can guess the value of "currentElement". 当编译器可以猜测“currentElement”的值时,以下构造在编译时有效。

float term1 = currentElement ? float term1 = currentElement? currentElement * DetOf2x2(...) : 0; currentElement * DetOf2x2(...):0;

If it cannot be guessed at compile time, it will be checked at run-time and the performance depends on processor architecture : the trade-off between a branch (include branch latency and the delay to rebuild the instruction pipeline can be up to 10 or 20 cycles) and flat code (some processors run 3 instructions per cycle) and hardware branch prediction (when the hardware supports branch prediction). 如果在编译时无法猜到,它将在运行时进行检查,性能取决于处理器体系结构:分支之间的权衡(包括分支延迟和重建指令管道的延迟可达10或20个周期)和平坦代码(一些处理器每个周期运行3个指令)和硬件分支预测(当硬件支持分支预测时)。

Since multiplications throughput is close to 1 cycle on a x86_64 processor, there is no perf differenec depending on operand values like 0.0, 1.0, 2.0 or 12345678.99. 由于乘法吞吐量在x86_64处理器上接近1个周期,因此根据操作数值(如0.0,1.0,2.0或12345678.99)不存在性能差异。 if such a difference exists, that would be perceived as a covert channel in cryptographic-style software. 如果存在这样的差异,那将被视为加密式软件中的隐蔽通道。

GCC allows to check function parameters at compile time GCC允许在编译时检查函数参数

inline float myFn(float currentElement, myMatrix M) 内联浮动myFn(float currentElement,myMatrix M)

{ {

#if __builtin_constant_p(currentElement) && currentElement == 0.0 #if __builtin_constant_p(currentElement)&& currentElement == 0.0

return 0.0; 返回0.0;

#else #其他

return currentElement * det(M); return currentElement * det(M);

#endif #万一

} }

you need to enable inlining and interprocedural optimizations in the compiler. 您需要在编译器中启用内联和过程间优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM