简体   繁体   中英

Floating Point addition using integer operations

I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors. So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.

Thanks in advance.

Most commonly, if the bits to be rounded away represent a value less than half that of the smallest bit to be retained, they are rounded downward, the same as truncation. If they represent more than half, they are rounded upward, thus adding one in the position of the smallest retained bit. If they are exactly half, they are rounded downward if the smallest retained bit is zero and upward if the bit is one. This is called “round-to-nearest, ties to even.”

This presumes you have all the bits you are rounding away, that none have been lost yet in the course of doing arithmetic. If you cannot keep all the bits, there are techniques for keeping track of enough information about them to do the correct rounding, such as maintaining three bits called guard, round, and sticky bits.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM