Properly subtracting float values

Question

I am trying to create an array of values. These values should be "2.4,1.6,.8,0". I am subtracting .8 at every step.

This is how I am doing it (code snippet):

float mean = [[_scalesDictionary objectForKey:@"M1"] floatValue];  //3.2f
float sD = [[_scalesDictionary objectForKey:@"SD1"] floatValue];   //0.8f

nextRegion = mean;
hitWall = NO;
NSMutableArray *minusRegion = [NSMutableArray array];


while (!hitWall) {

    nextRegion -= sD;

if(nextRegion<0.0f){
    nextRegion = 0.0f;
    hitWall = YES;
}

[minusRegion addObject:[NSNumber numberWithFloat:nextRegion]];

}

I am getting this output:

minusRegion = (
    "2.4",
    "1.6",
    "0.8000001",
    "1.192093e-07",
    0
)

I do not want the incredibly small number between .8 and 0. Is there a standard way to truncate these values?

Answer 1

Neither 3.2 nor .8 is exactly representable as a 32-bit float. The representable number closest to 3.2 is 3.2000000476837158203125 (in hexadecimal floating-point, 0x1.99999ap+1). The representable number closest to .8 is 0.800000011920928955078125 (0x1.99999ap-1).

When 0.800000011920928955078125 is subtracted from 3.2000000476837158203125, the exact mathematical result is 2.400000035762786865234375 (0x1.3333338p+1). This result is also not exactly representable as a 32-bit float. (You can see this easily in the hexadecimal floating-point. A 32-bit float has a 24-bit significand. “1.3333338” has one bit in the “1”, 24 bits in the middle six digits, and another bit in the ”8”.) So the result is rounded to the nearest 32-bit float, which is 2.400000095367431640625 (0x1.333334p+1).

Subtracting 0.800000011920928955078125 from that yields 1.6000001430511474609375 (0x1.99999cp+0), which is exactly representable. (The “1” is one bit, the five nines are 20 bits, and the “c” has two significant bits. The low bits two bits in the “c” are trailing zeroes and may be neglected. So there are 23 significant bits.)

Subtracting 0.800000011920928955078125 from that yields 0.800000131130218505859375 (0x1.99999ep-1), which is also exactly representable.

Finally, subtracting 0.800000011920928955078125 from that yields 1.1920928955078125e-07 (0x1p-23).

The lesson to be learned here is the floating-point does not represent all numbers, and it rounds results to give you the closest numbers it can represent. When writing software to use floating-point arithmetic, you must understand and allow for these rounding operations. One way to allow for this is to use numbers that you know can be represented. Others have suggested using integer arithmetic. Another option is to use mostly values that you know can be represented exactly in floating-point, which includes integers up to 2 ²⁴ . So you could start with 32 and subtract 8, yielding 24, then 16, then 8, then 0. Those would be the intermediate values you use for loop control and continuing calculations with no error. When you are ready to deliver results, then you could divide by 10, producing numbers near 3.2, 2.4, 1.6, .8, and 0 (exactly). This way, your arithmetic would introduce only one rounding error into each result, instead of accumulating rounding errors from iteration to iteration.

Answer 2

Another way to do this is to multiply the numbers you get by subtraction by 10, then convert to an integer, then divide that integer by by 10.0.

You can do this easily with the floor function (floorf) like this:

float newValue = floorf(oldVlaue*10)/10;

Answer 3

You're looking at good old floating-point rounding error. Fortunately, in your case it should be simple to deal with. Just clamp:

if( val < increment ){
    val = 0.0;
}

Although, as Eric Postpischil explained below :

Clamping in this way is a bad idea, because sometimes rounding will cause the iteration variable to be slightly less than the increment instead of slightly more, and this clamping will effectively skip an iteration. For example, if the initial value were 3.6f (instead of 3.2f), and the step were .9f (instead of .8f), then the values in each iteration would be slightly below 3.6, 2.7, 1.8, and .9. At that point, clamping converts the value slightly below .9 to zero, and an iteration is skipped.

Therefore it might be necessary to subtract a small amount when doing the comparison.

A better option which you should consider is doing your calculations with integers rather than floats, then converting later.

int increment = 8;
int val = 32;

while( val > 0 ){
    val -= increment;

    float new_float_val = val / 10.0;
};

Properly subtracting float values

Question

3 answers

solution1
3 2012-07-23 15:56:21

solution2
2 2012-07-23 04:14:21

solution3
2 ACCPTED 2012-07-23 04:31:15

Properly subtracting float values

Question

3 answers

solution1 3 2012-07-23 15:56:21

solution2 2 2012-07-23 04:14:21

solution3 2 ACCPTED 2012-07-23 04:31:15

solution1
3 2012-07-23 15:56:21

solution2
2 2012-07-23 04:14:21

solution3
2 ACCPTED 2012-07-23 04:31:15