简体繁体中英

Floating Point addition using integer operations

原文 2013-05-09 13:42:30 6 1 c++/ c/ floating-point/ floating-point-precision

I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors. So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.

Thanks in advance.

1 answers

Most commonly, if the bits to be rounded away represent a value less than half that of the smallest bit to be retained, they are rounded downward, the same as truncation. If they represent more than half, they are rounded upward, thus adding one in the position of the smallest retained bit. If they are exactly half, they are rounded downward if the smallest retained bit is zero and upward if the bit is one. This is called “round-to-nearest, ties to even.”

This presumes you have all the bits you are rounding away, that none have been lost yet in the course of doing arithmetic. If you cannot keep all the bits, there are techniques for keeping track of enough information about them to do the correct rounding, such as maintaining three bits called guard, round, and sticky bits.

Using integer and shifting to imitate floating point operations

Precision of floating point operations

Floating Point Addition Behavior - CPP

floating point addition rounding upwards

Floating point and integer ambiguity

Set floating point precision for operations

Floating point addition: loss-of-precision issues

Is floating point addition commutative in C++?

Is it possible to underflow a floating point addition in C++?

Is floating-point addition and multiplication associative?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Using integer and shifting to imitate floating point operations Precision of floating point operations Floating Point Addition Behavior - CPP floating point addition rounding upwards Floating point and integer ambiguity Set floating point precision for operations Floating point addition: loss-of-precision issues Is floating point addition commutative in C++? Is it possible to underflow a floating point addition in C++? Is floating-point addition and multiplication associative?

Related Tags

Floating Point addition using integer operations

Question

1 answers

solution1 2 2013-05-09 15:27:54

solution1
2 2013-05-09 15:27:54