简体   繁体   中英

Dividing uint64_t by numeric_limits<uint64_t>::max() to a floating point representation

Given a uint64_t value, is it possible to divide it by std::numeric_limits<uint64_t>::max() so that a floating point representation of the value results ( 0.0 to 1.0 representing 0 to 2^64-1 )?

Numbers bigger than max can be chalked up to undefined behaviour, as long as every number equal to or smaller than max is correctly divided to its floating point "counterpart" (or the nearest number the floating point type is capable of representing instead of the real value)

I'm not sure casting one (or both) sides to long double will result in correct values for all valid inputs, because the standard doesn't guarantee long double to have a mantissa of 64 bits. Is this possible at all?

Multiprecision arithmetic is not required. Within floating point arithmetic that uses less than 64 bits for the significand (aka mantissa) division by n max = std::numeric_limits<uint64_t>::max() can be computed in an exactly rounded way (ie the result of the computation will be identical to the closest approximation of the exact arithmetic ratio in the target floating point format) as follows:

n/n max = n/(2 64 -1) = n/2 64 /(1-2 -64 ) = n/2 64 *(1+2 -64 +2 -128 +...) = n/2 64 + whatever doesn't fit in the significand

Thus the result is

n/n max = n/2 64

The following C++ test program implements both the naive and accurate methods of computing the ratio n/n max :

#include <climits>
#include <cmath>
#include <iostream>
#include <limits>
#include <type_traits>


template<typename F, typename U>
F map_to_unit_range_naive(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    return F(n)/F(std::numeric_limits<U>::max());
}

template<typename F, typename U>
F map_to_unit_range_accurate(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    const int UBITS = sizeof(U) * CHAR_BIT;
    return std::ldexp(F(n), -UBITS);
}

template<class F, class U>
double error_mapping_to_unit_range(U n)
{
    const F r1 = map_to_unit_range_accurate<F>(n);
    const F r2 = map_to_unit_range_naive<F>(n);
    return (1-r2/r1);
}

#define CHECK_MAPPING_TO_UNIT_RANGE(n, result_type)                     \
    std::cout << "map_to_unit_range<" #result_type ">(" #n "): err="    \
              << error_mapping_to_unit_range<result_type>(n)*100 << "%" \
              << std::endl;

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,         float);
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u,  float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, float);
    std::cout << "\n";
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, double);
    return 0;
}

The program demonstrates that the naive method is on par with the carefully crafted code:

map_to_unit_range<float>(123u): err=0%
map_to_unit_range<float>(123ul): err=0%
map_to_unit_range<float>(1234567890u): err=0%
map_to_unit_range<float>(1234567890ul): err=0%

map_to_unit_range<double>(123ul): err=0%
map_to_unit_range<double>(1234567890ul): err=0%

This may seem surprising at first, but it has a simple explanation - if the floating point type cannot represent the integral value 2 N -1 exactly, then it rounds it to 2 N , effectively resulting in an accurate division on the next step (according to the above formula).

Note that when the precision of the floating point type exceeds the size of the integer type (so that 2 N -1 can be represented exactly) the premises for the formula are not met, and the "accurate" method stops being such:

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u, double);
    return 0;
}

Output:

map_to_unit_range<double>(123u): err=-2.32831e-08%
map_to_unit_range<double>(1234567890u): err=-2.32831e-08%

The "error" here is coming from the "accurate" method.


Credits:

Many thanks to @interjay and @Jonathan Mee for their thorough peer review of the previous versions of this answer.

The easiest, most strictly portable way I believe is boost::multiprecision::cpp_bin_float_quad :

#include <boost/multiprecision/cpp_bin_float.hpp>

#include <limits>
#include <cstdint>
#include <iostream>
#include <iomanip>


int main()
{
    using Float = boost::multiprecision::cpp_bin_float_quad;

    for (std::uint64_t i = 0 ; i < 64 ; ++i)
    {
        auto v = std::uint64_t(1) << i;
        auto x = Float(v);

        x /= std::numeric_limits<std::uint64_t>::max();

        // demonstrate lossless round-trip
        auto y = x * std::numeric_limits<std::uint64_t>::max();

        std::cout << std::setprecision(std::numeric_limits<Float>::digits10)
        << (x * 100) << "% : "
        << std::hex << y.convert_to<std::uint64_t>()
        << std::endl;
    }
}

expected results:

5.42101086242752217033113759205528e-18% : 1
1.08420217248550443406622751841106e-17% : 2
2.16840434497100886813245503682211e-17% : 4
4.33680868994201773626491007364422e-17% : 8
8.67361737988403547252982014728845e-17% : 10
1.73472347597680709450596402945769e-16% : 20
3.46944695195361418901192805891538e-16% : 40
6.93889390390722837802385611783076e-16% : 80
1.38777878078144567560477122356615e-15% : 100
2.7755575615628913512095424471323e-15% : 200
5.55111512312578270241908489426461e-15% : 400
1.11022302462515654048381697885292e-14% : 800
2.22044604925031308096763395770584e-14% : 1000
4.44089209850062616193526791541169e-14% : 2000
8.88178419700125232387053583082337e-14% : 4000
1.77635683940025046477410716616467e-13% : 8000
3.55271367880050092954821433232935e-13% : 10000
7.1054273576010018590964286646587e-13% : 20000
1.42108547152020037181928573293174e-12% : 40000
2.84217094304040074363857146586348e-12% : 80000
5.68434188608080148727714293172696e-12% : 100000
1.13686837721616029745542858634539e-11% : 200000
2.27373675443232059491085717269078e-11% : 400000
4.54747350886464118982171434538157e-11% : 800000
9.09494701772928237964342869076313e-11% : 1000000
1.81898940354585647592868573815263e-10% : 2000000
3.63797880709171295185737147630525e-10% : 4000000
7.27595761418342590371474295261051e-10% : 8000000
1.4551915228366851807429485905221e-09% : 10000000
2.9103830456733703614858971810442e-09% : 20000000
5.8207660913467407229717943620884e-09% : 40000000
1.16415321826934814459435887241768e-08% : 80000000
2.32830643653869628918871774483536e-08% : 100000000
4.65661287307739257837743548967072e-08% : 200000000
9.31322574615478515675487097934145e-08% : 400000000
1.86264514923095703135097419586829e-07% : 800000000
3.72529029846191406270194839173658e-07% : 1000000000
7.45058059692382812540389678347316e-07% : 2000000000
1.49011611938476562508077935669463e-06% : 4000000000
2.98023223876953125016155871338926e-06% : 8000000000
5.96046447753906250032311742677853e-06% : 10000000000
1.19209289550781250006462348535571e-05% : 20000000000
2.38418579101562500012924697071141e-05% : 40000000000
4.76837158203125000025849394142282e-05% : 80000000000
9.53674316406250000051698788284564e-05% : 100000000000
0.000190734863281250000010339757656913% : 200000000000
0.000381469726562500000020679515313826% : 400000000000
0.000762939453125000000041359030627651% : 800000000000
0.0015258789062500000000827180612553% : 1000000000000
0.00305175781250000000016543612251061% : 2000000000000
0.00610351562500000000033087224502121% : 4000000000000
0.0122070312500000000006617444900424% : 8000000000000
0.0244140625000000000013234889800848% : 10000000000000
0.0488281250000000000026469779601697% : 20000000000000
0.0976562500000000000052939559203394% : 40000000000000
0.195312500000000000010587911840679% : 80000000000000
0.390625000000000000021175823681358% : 100000000000000
0.781250000000000000042351647362715% : 200000000000000
1.56250000000000000008470329472543% : 400000000000000
3.12500000000000000016940658945086% : 800000000000000
6.25000000000000000033881317890172% : 1000000000000000
12.5000000000000000006776263578034% : 2000000000000000
25.0000000000000000013552527156069% : 4000000000000000
50.0000000000000000027105054312138% : 8000000000000000

You'll get better performance with boost::multiprecision::float128 but it will only work on gcc (specifying -std=g++NN) or intel compilers.

I would imply from your question:

I'm not sure casting one (or both) sides to long double will result in correct values for all valid inputs, because the standard doesn't guarantee long double to have a mantissa of 64 bits. Is this possible at all?

That what you're asking is:

Can any value representable by a uint64_t survive the round trip of being cast into a long double 's mantissa and back to a uint64_t ?

The answer is implementation dependent. The key lies in how many digits a long double uses for it's mantissa. Fortunately C++11 provides you with a way to get that: numeric_limits<long double>::digits For example:

const auto ui64max = numeric_limits<uint64_t>::max();
const auto foo = ui64max - 1;
const auto bar = static_cast<long double>(foo) / ui64max;

cout << "Max Digits For Roundtrip Guarantee: " << numeric_limits<long double>::digits << "\nMax Digits In uint64_t: " << numeric_limits<uint64_t>::digits << "\nConverting: " << foo << "\nTo long double Mantissa: " << bar << "\nRoundtrip Back To uint64_t: " <<  static_cast<uint64_t>(bar * ui64max) << endl;

Live Example

You can validate this fact at compile time with something like:

static_assert(numeric_limits<long double>::digits >= numeric_limits<uint64_t>::digits, "long double has insufficient mantissa precision in this implementation");

For more information on the math supporting round trip questions you can look here: Float Fractional Precision

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM