Rational to floating point

Question

Consider a rational number represented by the structure below.

struct rational {
    uint64_t n;
    uint64_t d;
    unsigned char sign : 1;
};

Assuming an IEEE-754 binary64 representation of double , how can the structure be converted to the nearest double with correct rounding? The naive method of converting n and d to double and dividing them clearly compounds rounding error.

Answer 1

One way of achieving the desired result is to perform the division in integer space. As standard C/C++ does not offer a 128-bit integer type (while some tool chains may offer this as an extension), this is not very efficient, but it will produce correct results.

The code below generates 54 quotient bits and a remainder, one bit at at time. The most significant 53 quotient bits represent the mantissa portion of the double result, while the least significant quotient bit and the remainder are needed for rounding to "nearest or even" according to IEEE-754.

The code below can be compiled as either a C or a C++ program (at least it does with my tool chain). It has been lightly tested. Due to the bit-wise processing, this isn't very fast, and various optimizations are possible, especially if machine-specific data types and intrinsics are employed.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>

struct rational {
    uint64_t n;
    uint64_t d;
    unsigned char sign : 1;
};

double uint64_as_double (uint64_t a)
{
    double res;
#if defined (__cplusplus)
    memcpy (&res, &a, sizeof (res));
#else /* __cplusplus */
    volatile union {
        double f;
        uint64_t i;
    } cvt;
    cvt.i = a;
    res = cvt.f;
#endif /* __cplusplus */
    return res;
}

#define ADDcc(a,b,cy,t0,t1) (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) (t0=(b)+cy, t1=(a), t0+t1)
#define SUBcc(a,b,cy,t0,t1) (t0=(b), t1=(a), cy=t1<t0, t1-t0)

double rational2double (struct rational a)
{
    uint64_t dividend, divisor, quot, rem, t0, t1, cy, res, expo;
    int sticky, round, odd, sign, i;

    dividend = a.n;
    divisor = a.d;
    sign = a.sign;

    /* handle special cases */
    if ((dividend == 0) && (divisor == 0)) {
        res = 0xFFF8000000000000ULL; /* NaN INDEFINITE */
    } else if (dividend == 0) {            
        res = (uint64_t)sign << 63; /* zero */
    } else if (divisor == 0) {
        res = ((uint64_t)sign << 63) | 0x7ff0000000000000ULL; /* Inf */
    } 
    /* handle normal cases */
    else {
        quot = dividend;
        rem = 0;
        expo = 0;
        /* normalize operands using 128-bit shifts */
        while (rem < divisor) {
            quot = ADDcc (quot, quot, cy, t0, t1);
            rem = ADDC (rem, rem, cy, t0, t1);
            expo--;
        }
        /* integer bit of quotient is known to be 1 */
        rem = rem - divisor;
        quot = quot + 1;
        /* generate 53 more quotient bits */
        for (i = 0; i < 53; i++) {
            quot = ADDcc (quot, quot, cy, t0, t1);
            rem = ADDC (rem, rem, cy, t0, t1);
            rem = SUBcc (rem, divisor, cy, t0, t1);
            if (cy) {
                rem = rem + divisor;
            } else {
                quot = quot + 1;
            }
        }
        /* round to nearest or even */
        sticky = rem != 0;
        round = quot & 1;
        quot = quot >> 1;
        odd = quot & 1;
        if (round && (sticky || odd)) {
            quot++;
        }
        /* compose normalized IEEE-754 double-precision number */
        res = ((uint64_t)sign << 63) + ((expo + 64 + 1023 - 1) << 52) + quot;
    }
    return uint64_as_double (res);
}

Rational to floating point

Question

1 answers

solution1
2 ACCPTED 2015-10-05 20:32:27

Rational to floating point

Question

1 answers

solution1 2 ACCPTED 2015-10-05 20:32:27

solution1
2 ACCPTED 2015-10-05 20:32:27