简体   繁体   中英

Precise sum of floating point numbers

I am aware of a similar question , but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.

Here is my first solution:

put all numbers into a min-absolute-heap. // EDIT as told by comments below
pop the 2 smallest ones.
add them.
put the result back into the heap.
continue until there is only 1 number in the heap.

This one would take O(n*logn) instead of normal O(n). Is that really worth it?

The second solution comes from the characteristic of the data I'm working on. It is a huge list of positive numbers with similar order of magnitude .

a[size]; // contains numbers, start at index 0
for(step = 1; step < size; step<<=1)
    for(i = step-1; i+step<size; i+=2*step)
        a[i+step] += a[i];
    if(i < size-1)
        a[size-1] += a[i];

The basic idea is to do sum in a 'binary tree' fashion.

Note: it's a pseudo C code. step<<=1 means multiply step by 2. This one would take O(n). I feel like there might be a better approach. Can you recommend/criticize?

Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).


Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double type, simply use the straightforward summation algorithm with the accumulator of type long double . Only convert the result to double at the very end.


If you really need a lot of precision, you can combine the above two solutions by using long double for variables c , y , t , sum in Kahan's summation algorithm.

如果您担心减少求和中的数值误差,那么您可能对Kahan的算法感兴趣。

My guess is that your binary decomposition will work almost as well as Kahan summation.

Here is an example to illustrate it:

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>

void sumpair( float *a, float *b)
{
    volatile float sum = *a + *b;
    volatile float small = sum - std::max(*a,*b);
    volatile float residue = std::min(*a,*b) - small;
    *a = sum;
    *b = residue;
}

void sumpairs( float *a,size_t size, size_t stride)
{
    if (size <= stride*2 ) {
        if( stride<size )
            sumpair(a+i,a+i+stride);
    } else {
        size_t half = 1;
        while(half*2 < size) half*=2;;
        sumpairs( a , half , stride );
        sumpairs( a+half , size-half , stride );
    }
}

void sumpairwise( float *a,size_t size )
{
    for(size_t stride=1;stride<size;stride*=2)
        sumpairs(a,size,stride);
}

int main()
{
    float data[10000000];
    size_t size= sizeof data/sizeof data[0];
    for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());

    float naive=0;
    for(size_t i=0;i<size;i++) naive+=data[i];
    printf("naive      sum=%.8g\n",naive);

    double dprec=0;
    for(size_t i=0;i<size;i++) dprec+=data[i];
    printf("dble prec  sum=%.8g\n",(float)dprec);

    sumpairwise( data , size );
    printf("1st approx sum=%.8g\n",data[0]);
    sumpairwise( data+1 , size-1);
    sumpairwise( data , 2 );
    printf("2nd approx sum=%.8g\n",data[0]);
    sumpairwise( data+2 , size-2);
    sumpairwise( data+1 , 2 );
    sumpairwise( data , 2 );
    printf("3rd approx sum=%.8g\n",data[0]);
    return 0;
}

I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture

g++  -ffloat-store  -Wl,-stack_size,0x20000000 test_sum.c

and get: (0.03125 is 1ULP)

naive      sum=-373226.25
dble prec  sum=-373223.03
1st approx sum=-373223
2nd approx sum=-373223.06
3rd approx sum=-373223.06

This deserve a little explanation.

  • I first display naive summation
  • Then double precision summation (Kahan is roughly equivalent to that)
  • The 1st approximation is the same as your binary decomposition. Except that I store the sum in data[0] and that I care of storing residues. This way, the exact sum of data before and after summation is unchanged
  • This enables me to approximate the error by summing the residues at 2nd iteration in order to correct the 1st iteration (equivalent to applying Kahan on binary summation)
  • By iterating further I can further refine the result and we see a convergence

The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.

This pseudocode produces the same results as your algorithm and runs in O(n) if the input is pre-sorted and the sorting algorithm detects that:

Queue<float> leaves = sort(arguments[0]).toQueue();
Queue<float> nodes = new Queue();

popAny = #(){
       if(leaves.length == 0) return nodes.pop();
  else if(nodes.length == 0) return leaves.pop();
  else if(leaves.top() > nodes.top()) return nodes.pop();
  else return leaves.pop();
}

while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());

return nodes.pop();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM