Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

Question

According to articles like this , half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?

Naive code:

uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;

The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.

Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.

Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?

Answer 1

您可以改用double进行计算，这样就不会对uint32_t值造成任何精度损失，然后将结果分配给float 。

float f = (double)i / (1ui32<<31) - 1.0;

Answer 2

In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:

//---------------------------------------------------------------------------
float i32_to_f32(int   x)
    {
    int exp;
    union _f32          // semi result
        {
        float f;        // 32bit floating point
        DWORD u;        // 32 bit uint
        } y;
    // edge cases
    if (x== 0x00000000) return  0.0f;
    if (x< -0x1FFFFFFF) return -1.0f;
    if (x> +0x1FFFFFFF) return +1.0f;
    // conversion
    y.u=0;                              // reset bits
    if (x<0){ y.u|=0x80000000; x=-x; }  // sign (31 bits left)
    exp=((x>>23)&63)-64;                // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
    y.u|=(exp+127)<<23;                 // exponent bias and bit position
    y.u|=x&0x007FFFFF;                  // mantissa
    return y.f;
    }
//---------------------------------------------------------------------------
int f32_to_i32(float x)
    {
    int exp,man,i;
    union _f32          // semi result
        {
        float f;        // 32bit floating point
        DWORD u;        // 32 bit uint
        } y;
    // edge cases
    if (x== 0.0f) return  0x00000000;
    if (x<=-1.0f) return -0x1FFFFFFF;
    if (x>=+1.0f) return +0x1FFFFFFF;
    // conversion
    y.f=x;
    exp=(y.u>>23)&255; exp-=127;        // exponent bias and bit position
    if (exp<-64) return 0.0f;
    man=y.u&0x007FFFFF;                 // mantissa
    i =(exp<<23)&0x1F800000;
    i|= man;
    if (y.u>=0x80000000) i=-i;          // sign
    return i;
    }
//---------------------------------------------------------------------------

I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).

But the distribution is not linear nor uniform at all:

in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF> .

On the other hand there is no rounding at all in both conversions ...

PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).

Answer 3

The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).

So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX .

This expression works: (float)(int)(value^0x80000000)/0x80000000 .

Here's how it works:

The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN , and 0xffffffff gets mapped to INT_MAX .
Then there is conversion to float . This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1] . As this division just adjusts the exponent part, this division doesn't lose any precision.

So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)

But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0) .

Answer 4

我建议（如果您想避免除法，并使用精确的浮点数表示的起始值1.0 * 2 ^ -32）：

float e = i * ldexp(1.0,-32) - 1.0;

Answer 5

Any ideas how to achieve this without using double-precision floating-point?

Without assuming too much about the insides of float :

Shift u until the most significant bit is set, halving the float conversion value.

"keeping the uniform distribution"

50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
...

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

float ui32to0to1(uint32_t u) {
  if (u) {
    float band = 1.0f/(1llu<<32);
    while ((u & 0x80000000) == 0) {
      u <<= 1;
      band *= 0.5f;
    }
    return (float)u * band;
  }
  return 0.0f;
}

Some test code to show functional equivalence to double .

int test(uint32_t u) {
  volatile float f0 = (float) ((double)u / (1llu<<32));
  volatile float f1 = ui32to0to1(u);
  if (f0 != f1) {
    printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
    return 1;
  }
  return 0;
}

int main(void) {
  for (int i=0; i<100000000; i++) {
    test(rand()*65535u ^ rand());
  }
  return 0;
}

Various optimizations are possible, especially with assuming properties of float . Yet for an initial answer, I'll stick to a general approach.

For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.

float ui32to0to1(uint32_t u) {
  float band = 1.0f/(1llu<<32);
  for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
    u <<= 1;
    band *= 0.5f;
  }
  return (float)u * band;
}

This answers maps [0 to 2 ³² -1] to [0.0 to 1.0)

To map to [0 to 2 ³² -1] to (-1.0 to 1.0). It can form -0.0.

if (u >= 0x80000000) {
  return ui32to0to1((u - 0x80000000)*2);
} else
  return -ui32to0to1((0x7FFFFFFF - u)*2);
}

Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

Question

5 answers

solution1
2 2019-06-20 20:17:15

solution2
1 2019-06-21 10:14:19

solution3
1 2019-07-01 07:27:29

solution4
0

solution5
0 2019-06-30 23:27:45

Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

Question

5 answers

solution1 2 2019-06-20 20:17:15

solution2 1 2019-06-21 10:14:19

solution3 1 2019-07-01 07:27:29

solution4 0

solution5 0 2019-06-30 23:27:45

solution1
2 2019-06-20 20:17:15

solution2
1 2019-06-21 10:14:19

solution3
1 2019-07-01 07:27:29

solution4
0

solution5
0 2019-06-30 23:27:45