这个浮点平方根逼近是如何工作的？

Question

I found a rather strange but working square root approximation for float s; 我找到了一个相当奇怪但工作平方根近似的float ; I really don't get it. 我真的不明白。 Can someone explain me why this code works? 有人能解释一下为什么这段代码有效吗？

float sqrt(float f)
{
    const int result = 0x1fbb4000 + (*(int*)&f >> 1);
    return *(float*)&result;   
}

I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3% . 我测试了一下，它将std::sqrt()的值输出大约1到3％。 I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works . 我知道Quake III的快速反平方根，我想这里有类似的东西（没有牛顿迭代），但我真的很感激它的工作原理 。

(nota: I've tagged it both c and c++ since it's both valid-ish (see comments) C and C++ code) （nota：我已经用c和c ++标记了它，因为它既有效-ish（见注释）C和C ++代码）

Answer 1

(*(int*)&f >> 1) right-shifts the bitwise representation of f . (*(int*)&f >> 1)右移f的按位表示。 This almost divides the exponent by two, which is approximately equivalent to taking the square root. 这几乎将指数除以2，这大约相当于取平方根。 ¹ ¹

Why almost ? 为何几乎？ In IEEE-754, the actual exponent is e - 127 . 在IEEE-754中，实际指数是e-127 。 ² To divide this by two, we'd need e/2 - 64 , but the above approximation only gives us e/2 - 127 . ²要将此除以2，我们需要e / 2 - 64 ，但上述近似值仅给出e / 2 - 127 。 So we need to add on 63 to the resulting exponent. 所以我们需要在结果指数上加上63。 This is contributed by bits 30-23 of that magic constant ( 0x1fbb4000 ). 这是由该魔术常量（ 0x1fbb4000 ）的位30-23贡献的。

I'd imagine the remaining bits of the magic constant have been chosen to minimise the maximum error across the mantissa range, or something like that. 我想是已经选择了魔术常数的剩余部分来最小化尾数范围内的最大误差，或类似的东西。 However, it's unclear whether it was determined analytically, iteratively, or heuristically. 然而，尚不清楚它是通过分析，迭代还是启发式确定的。

It's worth pointing out that this approach is somewhat non-portable. 值得指出的是，这种方法有点不便携。 It makes (at least) the following assumptions: 它（至少）做出以下假设：

The platform uses single-precision IEEE-754 for float . 该平台使用单精度IEEE-754进行float 。
The endianness of float representation. float表示的字节顺序。
That you will be unaffected by undefined behaviour due to the fact this approach violates C/C++'s strict-aliasing rules . 由于这种方法违反了C / C ++的严格别名规则，因此您不会受到未定义行为的影响。

Thus it should be avoided unless you're certain that it gives predictable behaviour on your platform (and indeed, that it provides a useful speedup vs. sqrtf !). 因此，应该避免它，除非你确定它在你的平台上提供了可预测的行为（事实上，它提供了一个有用的加速比sqrtf ！）。

_{1. sqrt(a^b) = (a^b)^0.5 = a^(b/2)} _{1. sqrt（a ^ b）=（a ^ b）^ 0.5 = a ^（b / 2）}

_{2. See eg https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding} _{2.参见https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding}

Answer 2

See Oliver Charlesworth's explanation of why this almost works. 见为什么这几乎工程奥利弗查尔斯沃思解释。 I'm addressing an issue raised in the comments. 我正在解决评论中提出的问题。

Since several people have pointed out the non-portability of this, here are some ways you can make it more portable, or at least make the compiler tell you if it won't work. 由于有几个人已经指出了它的不可移植性，这里有一些方法可以使它更具可移植性，或者至少让编译器告诉你它是否不起作用。

First, C++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, such as in a static_assert . 首先，C ++允许您在编译时检查std::numeric_limits<float>::is_iec559 ，例如在static_assert 。 You can also check that sizeof(int) == sizeof(float) , which will not be true if int is 64-bits, but what you really want to do is use uint32_t , which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflow, and will cause a compilation error if your weird architecture has no such integral type. 您还可以检查sizeof(int) == sizeof(float) ，如果int是64位，则不会为true，但您真正想要的是使用uint32_t ，如果它存在则总是正好是32位宽，如果您的奇怪架构没有这样的整数类型，将会有明确定义的带有移位和溢出的行为，并会导致编译错误。 Either way, you should also static_assert() that the types have the same size. 无论哪种方式，您还应该static_assert()表示类型具有相同的大小。 Static assertions have no run-time cost and you should always check your preconditions this way if possible. 静态断言没有运行时成本，如果可能的话，您应该始终以这种方式检查前提条件。

Unfortunately, the test of whether converting the bits in a float to a uint32_t and shifting is big-endian, little-endian or neither cannot be computed as a compile-time constant expression. 不幸的是，是否将float的位转换为uint32_t和移位的测试是big-endian，little-endian或者都不能计算为编译时常量表达式。 Here, I put the run-time check in the part of the code that depends on it, but you might want to put it in the initialization and do it once. 在这里，我将运行时检查放在依赖于它的代码部分，但您可能希望将其置于初始化中并执行一次。 In practice, both gcc and clang can optimize this test away at compile time. 实际上，gcc和clang都可以在编译时优化此测试。

You do not want to use the unsafe pointer cast, and there are some systems I've worked on in the real world where that could crash the program with a bus error. 你不想使用不安全的指针转换，并且我在现实世界中有一些系统可能会因为总线错误而导致程序崩溃。 The maximally-portable way to convert object representations is with memcpy() . 转换对象表示的最大可移植方式是使用memcpy() 。 In my example below, I type-pun with a union , which works on any actually-existing implementation. 在下面的示例中，我使用union类型化处理，它适用于任何实际存在的实现。 (Language lawyers object to it, but no successful compiler will ever break that much legacy code silently .) If you must do a pointer conversion (see below) there is alignas() . （语言律师反对它，但没有成功的编译器会默默地破坏那么多遗留代码。）如果你必须进行指针转换（见下文），则有alignas() 。 But however you do it, the result will be implementation-defined, which is why we check the result of converting and shifting a test value. 但无论如何，结果将是实现定义的，这就是我们检查转换和移动测试值的结果的原因。

Anyway, not that you're likely to use it on a modern CPU, here's a gussied-up C++14 version that checks those non-portable assumptions: 无论如何，并不是说您可能在现代CPU上使用它，这是一个经过考验的C ++ 14版本，可以检查那些不可移植的假设：

#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>

using std::cout;
using std::endl;
using std::size_t;
using std::sqrt;
using std::uint32_t;

template <typename T, typename U>
  inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it reads an inactive union member.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  union tu_pun {
    U u = U();
    T t;
  };

  const tu_pun pun{x};
  return pun.t;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
const bool is_little_endian = after_rshift == target;

float est_sqrt(const float x)
/* A fast approximation of sqrt(x) that works less well for subnormal numbers.
 */
{
  static_assert( std::numeric_limits<float>::is_iec559, "" );
  assert(is_little_endian); // Could provide alternative big-endian code.

 /* The algorithm relies on the bit representation of normal IEEE floats, so
  * a subnormal number as input might be considered a domain error as well?
  */
  if ( std::isless(x, 0.0F) || !std::isfinite(x) )
    return std::numeric_limits<float>::signaling_NaN();

  constexpr uint32_t magic_number = 0x1fbb4000UL;
  const uint32_t raw_bits = reinterpret<uint32_t,float>(x);
  const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number;
  return reinterpret<float,uint32_t>(rejiggered_bits);
}

int main(void)
{  
  static const std::vector<float> test_values{
    4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F };

  for ( const float& x : test_values ) {
    const double gold_standard = sqrt((double)x);
    const double estimate = est_sqrt(x);
    const double error = estimate - gold_standard;

    cout << "The error for (" << estimate << " - " << gold_standard << ") is "
         << error;

    if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) {
      const double error_pct = error/gold_standard * 100.0;
      cout << " (" << error_pct << "%).";
    } else
      cout << '.';

    cout << endl;
  }

  return EXIT_SUCCESS;
}

Update 更新

Here is an alternative definition of reinterpret<T,U>() that avoids type-punning. 这是reinterpret<T,U>()的另一种定义，它避免了类型惩罚。 You could also implement the type-pun in modern C, where it's allowed by standard, and call the function as extern "C" . 您还可以在现代C中实现type-pun，标准允许它，并将函数称为extern "C" 。 I think type-punning is more elegant, type-safe and consistent with the quasi-functional style of this program than memcpy() . 我认为类型惩罚比memcpy()更优雅，类型安全并且与该程序的准功能样式一致。 I also don't think you gain much, because you still could have undefined behavior from a hypothetical trap representation. 我也不认为你获得了太多，因为你仍然可以从假设的陷阱表示中得到未定义的行为。 Also, clang++ 3.9.1 -O -S is able to statically analyze the type-punning version, optimize the variable is_little_endian to the constant 0x1 , and eliminate the run-time test, but it can only optimize this version down to a single-instruction stub. 此外，clang ++ 3.9.1 -O -S能够静态分析类型 - 双关语版本，将变量is_little_endian优化为常数0x1 ，并消除运行时测试，但它只能将此版本优化为单个 -指令存根。

But more importantly, this code isn't guaranteed to work portably on every compiler. 但更重要的是，这些代码不能保证在每个编译器上都可以移植。 For example, some old computers can't even address exactly 32 bits of memory. 例如，一些旧计算机甚至无法准确地处理32位内存。 But in those cases, it should fail to compile and tell you why. 但在这些情况下，它应该无法编译并告诉你原因。 No compiler is just suddenly going to break a huge amount of legacy code for no reason. 没有任何编译器会突然间无缘无故地破坏大量的遗留代码。 Although the standard technically gives permission to do that and still say it conforms to C++14, it will only happen on an architecture very different from we expect. 虽然标准在技术上允许这样做，并且仍然说它符合C ++ 14，但它只会发生在与我们期望的完全不同的架构上。 And if our assumptions are so invalid that some compiler is going to turn a type-pun between a float and a 32-bit unsigned integer into a dangerous bug, I really doubt the logic behind this code will hold up if we just use memcpy() instead. 如果我们的假设是如此无效以至于某些编译器会将float和32位无符号整数之间的类型 - 双关语变为危险的错误，我真的怀疑如果我们只使用memcpy()这个代码背后的逻辑将会支持memcpy()而不是。 We want that code to fail at compile time, and to tell us why. 我们希望代码在编译时失败，并告诉我们原因。

#include <cassert>
#include <cstdint>
#include <cstring>

using std::memcpy;
using std::uint32_t;

template <typename T, typename U> inline T reinterpret(const U &x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it modifies a variable.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  T temp;

  memcpy( &temp, &x, sizeof(T) );
  return temp;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
extern const bool is_little_endian = after_rshift == target;

However, Stroustrup et al., in the C++ Core Guidelines , recommend a reinterpret_cast instead: 但是，Stroustrup等人在C ++核心指南中推荐使用reinterpret_cast ：

#include <cassert>

template <typename T, typename U> inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it uses reinterpret_cast.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  const U temp alignas(T) alignas(U) = x;
  return *reinterpret_cast<const T*>(&temp);
}

The compilers I tested can also optimize this away to a folded constant. 我测试的编译器也可以将其优化为折叠常数。 Stroustrup's reasoning is [sic]: Stroustrup的推理是[原文如此]：

Accessing the result of an reinterpret_cast to a different type from the objects declared type is still undefined behavior, but at least we can see that something tricky is going on. 将reinterpret_cast的结果访问到与声明类型的对象不同的类型仍然是未定义的行为，但至少我们可以看到一些棘手的事情正在发生。

Answer 3

Let y = sqrt(x), 设y = sqrt（x），

it follows from the properties of logarithms that log(y) = 0.5 * log(x) (1) 从log（y）= 0.5 * log（x）（1）的对数属性得出

Interpreting a normal float as an integer gives INT(x) = Ix = L * (log(x) + B - σ) (2) 将普通float解释为整数给出INT（x）= Ix = L *（log（x）+ B - σ）（2）

where L = 2^N, N the number of bits of the significand, B is the exponent bias, and σ is a free factor to tune the approximation. 其中L = 2 ^ N，N是有效数的位数，B是指数偏差，σ是调整近似值的自由因子。

Combining (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - σ))) 结合（1）和（2）给出：Iy = 0.5 *（Ix +（L *（B-σ）））

Which is written in the code as (*(int*)&x >> 1) + 0x1fbb4000; 在代码中写为(*(int*)&x >> 1) + 0x1fbb4000;

Find the σ so that the constant equals 0x1fbb4000 and determine whether it's optimal. 找到σ使常量等于0x1fbb4000并确定它是否是最优的。

Answer 4

Adding a wiki test harness to test all float . 添加wiki测试工具来测试所有float 。

The approximation is within 4% for many float , but very poor for sub-normal numbers. 对于许多float ，近似值在4％以内，但对于次正常数值则非常差。 YMMV 因人而异

Worst:1.401298e-45 211749.20%
Average:0.63%
Worst:1.262738e-38 3.52%
Average:0.02%

Note that with argument of +/-0.0, the result is not zero. 请注意，如果参数为+/- 0.0，则结果不为零。

printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0));  //  0.000000e+00  7.930346e-20
printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19

Test code 测试代码

#include <float.h>
#include <limits.h>
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

float sqrt_apx(float f) {
  const int result = 0x1fbb4000 + (*(int*) &f >> 1);
  return *(float*) &result;
}

double error_value = 0.0;
double error_worst = 0.0;
double error_sum = 0.0;
unsigned long error_count = 0;

void sqrt_test(float f) {
  if (f == 0) return;
  volatile float y0 = sqrtf(f);
  volatile float y1 = sqrt_apx(f);
  double error = (1.0 * y1 - y0) / y0;
  error = fabs(error);
  if (error > error_worst) {
    error_worst = error;
    error_value = f;
  }
  error_sum += error;
  error_count++;
}

void sqrt_tests(float f0, float f1) {
  error_value = error_worst = error_sum = 0.0;
  error_count = 0;
  for (;;) {
    sqrt_test(f0);
    if (f0 == f1) break;
    f0 = nextafterf(f0, f1);
  }
  printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0);
  printf("Average:%.2f%%\n", error_sum / error_count);
  fflush(stdout);
}

int main() {
  sqrt_tests(FLT_TRUE_MIN, FLT_MIN);
  sqrt_tests(FLT_MIN, FLT_MAX);
  return 0;
}

这个浮点平方根逼近是如何工作的？

问题描述

4 个解决方案

解决方案1
70 已采纳 2017-03-30 14:14:55

解决方案2
13 2017-03-30 23:28:02

Update 更新

解决方案3
8 2017-03-30 15:51:14

解决方案4
6

这个浮点平方根逼近是如何工作的？

问题描述

4 个解决方案

解决方案1 70 已采纳 2017-03-30 14:14:55

解决方案2 13 2017-03-30 23:28:02

Update 更新

解决方案3 8 2017-03-30 15:51:14

解决方案4 6

解决方案1
70 已采纳 2017-03-30 14:14:55

解决方案2
13 2017-03-30 23:28:02

解决方案3
8 2017-03-30 15:51:14

解决方案4
6