IEEE-754 Floating Point Exponent Alignment Issue

Question

I'm making a floating point calculator from the ground up basically, and I'm having an issue with the part where you align the exponents of two numbers in the case that they are not equal.

For instance: 75.2 + 12.25 = 84.75

But my program is instead returning 106.5

Here is the code for the function that aligns the exponents:

void align(MyStruct* a, MyStruct* b)
{
   if (a->exponent > b->exponent)
   {
      b->exponent = a->exponent; // Sets the exponent of b = to a 
      b->fraction >>= a->exponent - b->exponent // Shifts the mantissa (fraction) bits of b to the right
   }
   return;
}

I don't know what I'm doing wrong here. The binary representation for the example equation above is as shown:

0|10000101|00100010000000000000000 A

0|10000010|10001000000000000000000 B +

When I do b->exponent = a->exponent; , I'm expecting it to make b

0|10000101|10001000000000000000000 , which goes smoothly. Then I expect the mantissa portion of b to be shifted right as many times is necessary to make up for the added bits that go past the 23 bit limit (in this case, it's 3) This also happens without issue, leaving b to become 0|10000101|00010001000000000000000

As far as this, I would expect to get the correct results. However it does not produce the correct number. Looking into it further with other floating point calculators online, it appears that the result of a + b is represented as 0|10000101|01010011000000000000000 in binary.

However, when adding my two modified mantissas together, that is not the result I get. What am I doing wrong here? The only thing I suspect is that the hidden bit (the 1) is not being shifted during the process. Is this the case?

I should mention that my structs are composed of three integer variables, each of which represent the individual parts of the IEEE-754 floating point formation (sign, exponent, fraction/mantissa). So the mantissa for A for example would be 00000000000100010000000000000000 (32 bits instead of 23, but when they're all parsed it becomes the full representation of the float). Also, I am pretty positive that my other functions are working as intended, and that the align is the issue here.

Any advice?

Answer 1

I believe the calculation would have been wrong even if I did not fix the issue to begin with because I was shifting based on the difference between the exponents, however that would mean I'm shifting 0 times since I set the exponents equal to one another. So that was a silly oversight by me.
The actual issue was resolved by setting the 24th bit in the mantissa being shifted. The bit technically doesn't exist, but as someone pointed out, it is implied to be there and will be moved over when the shifting occurs.

The fixed code would be as:

void align(MyStruct* a, MyStruct* b)
{
    if (a->exponent != b->exponent) // If the exponents are not equal
    {
        if (a->exponent > b->exponent)
        {
            int disp = a->exponent - b->exponent; // number of shifts needed based on difference between two exponents
            a->fraction |= 1 << 23; // sets the implicit bit for shifting
            b->exponent = a->exponent; // sets exponents equal to each other
            a->fraction >>= disp; // mantissa is shifted over to accommodate for the increase in power
            return;
        }
        int disp = b->exponent - a->exponent;
        a->fraction |= 1 << 23;
        a->exponent = b->exponent;
        a->fraction >>= disp;
        return;
    }
    return;
}

Thanks to those that helped!

IEEE-754 Floating Point Exponent Alignment Issue

Question

1 answers

solution1
0 2020-07-26 12:00:05

IEEE-754 Floating Point Exponent Alignment Issue

Question

1 answers

solution1 0 2020-07-26 12:00:05

solution1
0 2020-07-26 12:00:05