Precise sum of floating point numbers

Question

I am aware of a similar question , but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.

Here is my first solution:

put all numbers into a min-absolute-heap. // EDIT as told by comments below
pop the 2 smallest ones.
add them.
put the result back into the heap.
continue until there is only 1 number in the heap.

This one would take O(n*logn) instead of normal O(n). Is that really worth it?

The second solution comes from the characteristic of the data I'm working on. It is a huge list of positive numbers with similar order of magnitude .

a[size]; // contains numbers, start at index 0
for(step = 1; step < size; step<<=1)
    for(i = step-1; i+step<size; i+=2*step)
        a[i+step] += a[i];
    if(i < size-1)
        a[size-1] += a[i];

The basic idea is to do sum in a 'binary tree' fashion.

Note: it's a pseudo C code. step<<=1 means multiply step by 2. This one would take O(n). I feel like there might be a better approach. Can you recommend/criticize?

Answer 1

Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).

Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double type, simply use the straightforward summation algorithm with the accumulator of type long double . Only convert the result to double at the very end.

If you really need a lot of precision, you can combine the above two solutions by using long double for variables c , y , t , sum in Kahan's summation algorithm.

Answer 2

如果您担心减少求和中的数值误差，那么您可能对Kahan的算法感兴趣。

Answer 3

My guess is that your binary decomposition will work almost as well as Kahan summation.

Here is an example to illustrate it:

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>

void sumpair( float *a, float *b)
{
    volatile float sum = *a + *b;
    volatile float small = sum - std::max(*a,*b);
    volatile float residue = std::min(*a,*b) - small;
    *a = sum;
    *b = residue;
}

void sumpairs( float *a,size_t size, size_t stride)
{
    if (size <= stride*2 ) {
        if( stride<size )
            sumpair(a+i,a+i+stride);
    } else {
        size_t half = 1;
        while(half*2 < size) half*=2;;
        sumpairs( a , half , stride );
        sumpairs( a+half , size-half , stride );
    }
}

void sumpairwise( float *a,size_t size )
{
    for(size_t stride=1;stride<size;stride*=2)
        sumpairs(a,size,stride);
}

int main()
{
    float data[10000000];
    size_t size= sizeof data/sizeof data[0];
    for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());

    float naive=0;
    for(size_t i=0;i<size;i++) naive+=data[i];
    printf("naive      sum=%.8g\n",naive);

    double dprec=0;
    for(size_t i=0;i<size;i++) dprec+=data[i];
    printf("dble prec  sum=%.8g\n",(float)dprec);

    sumpairwise( data , size );
    printf("1st approx sum=%.8g\n",data[0]);
    sumpairwise( data+1 , size-1);
    sumpairwise( data , 2 );
    printf("2nd approx sum=%.8g\n",data[0]);
    sumpairwise( data+2 , size-2);
    sumpairwise( data+1 , 2 );
    sumpairwise( data , 2 );
    printf("3rd approx sum=%.8g\n",data[0]);
    return 0;
}

I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture

g++  -ffloat-store  -Wl,-stack_size,0x20000000 test_sum.c

and get: (0.03125 is 1ULP)

naive      sum=-373226.25
dble prec  sum=-373223.03
1st approx sum=-373223
2nd approx sum=-373223.06
3rd approx sum=-373223.06

This deserve a little explanation.

I first display naive summation
Then double precision summation (Kahan is roughly equivalent to that)
The 1st approximation is the same as your binary decomposition. Except that I store the sum in data[0] and that I care of storing residues. This way, the exact sum of data before and after summation is unchanged
This enables me to approximate the error by summing the residues at 2nd iteration in order to correct the 1st iteration (equivalent to applying Kahan on binary summation)
By iterating further I can further refine the result and we see a convergence

Answer 4

The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.

This pseudocode produces the same results as your algorithm and runs in O(n) if the input is pre-sorted and the sorting algorithm detects that:

Queue<float> leaves = sort(arguments[0]).toQueue();
Queue<float> nodes = new Queue();

popAny = #(){
       if(leaves.length == 0) return nodes.pop();
  else if(nodes.length == 0) return leaves.pop();
  else if(leaves.top() > nodes.top()) return nodes.pop();
  else return leaves.pop();
}

while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());

return nodes.pop();

Precise sum of floating point numbers

Question

4 answers

solution1
20 ACCPTED 2012-11-16 13:40:32

solution2
9 2012-11-16 13:40:07

solution3
2 2012-11-18 01:16:41

solution4
1 2012-11-16 14:06:42

Precise sum of floating point numbers

Question

4 answers

solution1 20 ACCPTED 2012-11-16 13:40:32

solution2 9 2012-11-16 13:40:07

solution3 2 2012-11-18 01:16:41

solution4 1 2012-11-16 14:06:42

solution1
20 ACCPTED 2012-11-16 13:40:32

solution2
9 2012-11-16 13:40:07

solution3
2 2012-11-18 01:16:41

solution4
1 2012-11-16 14:06:42