简体   繁体   中英

comparing two sums of floating point values in C or C++

Assume You're given two sets of floating point variables implemented according to IEEE754, meant to be treated as exact values calculated according to formulae present in standard. All legal values are possible. The amount of variables in set may be any natural number.

What would be a good way to compare exact, in mathematical sense, sums of values represented by said variables. Due to domain's nature, the problem can easily be represented as comparing a single sum to zero. You can disregard the possibility of presence of NaNs or Infinities, as it is irrelevant to core problem. (Those values can be checked for easily and independently, and acted upon in a manner suiting particular application of this problem.)

A naive approach would be to simply sum and compare, or sum values of one set and subtract values of another.

    bool compare(const std::vector<float>& lhs, const std::vector<float>& rhs)
    {
        float lSum = 0.0f;
        for (auto value : lhs)
        {
            lSum += value;
        }
        float rSum = 0.0f;
        for (auto value : rhs)
        {
            rSum += value;
        }

        return lSum < rSum;
    }

Quite obviously there are problems with naive approach, as mentioned in various other questions regarding floating point arithmetic. Most of the problems are related to two difficulties:

  • result of addition of floating point values differs depending on order
  • certain orders of addition of certain sets of values may result in intermediate overflow (intermediate result of calculations goes beyond range supported by available data type)

     float small = strtof("0x1.0p-126", NULL); float big = strtof("0x1.8p126", NULL); std::cout << std::hexfloat << small + big - big << std::endl; std::cout << std::hexfloat << (big-2*small) + (big-small) + big - (big+small) - (big+2*small) << std::endl; 

    This code will result in 0 and inf ; this illustrates how ordering affects the result. Hopefully, also that the problem of ordering is non-trivial.

     float prev; float curr = 0.0f; do { prev = curr; curr += strtof("0x1.0p-126", NULL); } while (prev != curr); std::cout << std::hexfloat << curr << std::endl; 

This code, given sufficient time to actually finish computing, would result in 0x1.000000p-102 , not, as could be naively expected, 0x1.fffffep127 (Change of curr initialization to `strtof("0x1.fff000p-103") would be advised to actually observe this.); this illustrates how proportion between intermediate results of addition and particular addends affects the result.

A lot has been said about obtaining best precision, eg. this question .

The problem at hand differs in that we do not want to maximize precision, but we have a well-defined function that needs to be implemented exactly.

While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not.

This, or just curiosity.

In the discussion, Kahan summation algorithm has been mentioned as relevant. However this algorithm is a reasonable attempt at minimizing error. It neither guarantees correct sign of result, nor is independent of the order of operations (to at least guarantee consistent, if wrong, result, for permutations of sets).

One of the most obvious solutions would be to employ/implement fixed point arithmetic using sufficient amount of bits to represent every possible operand value exactly and keep exact intermediate result.

Perhaps however this can be done using only floating point arithmetic in a manner that guarantees correct sign of result. If so, the problem of overflow (as illustrated in one of the examples above) needs to be addressed in solution, as this question has particular technical aspect.

(What follows is original question.)

I have two sets of multiple floating point (float or double) values. I want to provide a perfect answer to the question, which set has larger sum. Because of artifacts in floating point arithmetic, in some corner cases the result of naive approach may be wrong, depending on order of operations. Not to mention simple sum can result in overflow. I can't provide any effort on my side, because all I have is vague ideas, all of them complicated and not convincing.

One possible approach is to compute the sum using a superaccumulator : this is an algorithm for computing exact sums of floating point numbers. Although these ideas have been around for a while, the term is a relatively new one.

In some sense, you can think of it as an extension of Kahan summation, where the sequential sum is stored as an array of values, rather than just a pair. The main challenge then becomes figuring out how to allocate the precision amongst the various values.

Some relevant papers and code:

  • YK Zhu and WB Hayes. "Algorithm 908: Online Exact Summation of Floating-Point Streams". ACM Transactions on Mathematical Software (ACM TOMS), 37(3):37:1-37:13, September 2010. doi: 10.1145/1824801.1824815

    • Unfortunately the paper and code are behind a paywall, but this appears to be the C++ code .
  • RM Neal, "Fast Exact Summation using Small and Large Superaccumulators". 2015. arXiv: 1505.05571

  • MT Goodrich, A. Eldawy "Parallel Algorithms for Summing Floating-Point Numbers". 2016. arXiv: 1605.05436

Post was originally also a C one and so my code applies to that.
I now see post is C++ only, yet I see little in the following that would not readily apply to C++.

Simplify to finding the sign of the sum of a list of FP numbers

Comparing 2 sets of numbers is like appending the negation of the second set to the first and then finding the sign of the sum of the joint list. This sign maps to > , == or < of the 2 original sets.

Perform only exact FP math

Assumption : FP employs IEEE like numbers including sub-normals, base 2, and is exact for certain operations:

  1. Addition of a +b with the same binary exponent and differing sign.

  2. Subtraction of same sign 0.5 from a number in the 0.5 <= |x| < 1.0 0.5 <= |x| < 1.0 range.

  3. ldexp*() (break number into significant and exponent parts) function returns an exact value.

Form array per exponent

Form an array of sums sums[] whose values will only ever be (0 or 0.5 <= |sums[i]| < 1.0) , one for each possible exponent and for some exponents larger than the max. The larger ones are needed to accumulate a |total_sum| that exceeds FP_MAX . This needs up to log2(SIZE_MAX) more elements.

Add the set of numbers to sums[]

For each element of the number set, add it to the corresponding sums[] per its binary exponent. This is key as addition of same sign and differing sign FP numbers with a common FP binary exponent can be done exactly . The addition may result in a carry with same sign values and cancellation with differing sign values - this is handled. The incoming set of numbers need not be sorted.

Normalize sum[]

For each element on ones[] , insure any values not 0.5, 0.0 or -0.5 is reduced, the remaining portion added to smaller ones[] .

Inspect sum[] for the most significant digit

The most significant (non-zero) one[s] is the sign of the result.


The below code performs the task using float as the set's FP type. Some parallel calculations are done using double to check for sanity, but do not contribute to the float calculation.

The normalizing step in the end iterates typically twice. Even a worst case set, I suspect would iterate the binary width of the float signicand, about 23 times.

The solution appears to be about O(n) , but does use an array about the size of the FP's exponent range.

#include <assert.h>
#include <stdbool.h>
#include <float.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>

#if RAND_MAX/2 >= 0x7FFFFFFFFFFFFFFF
#define LOOP_COUNT 1
#elif RAND_MAX/2 >= 0x7FFFFFFF
#define LOOP_COUNT 2
#elif RAND_MAX/2 >= 0x1FFFFFF
#define LOOP_COUNT 3
#elif RAND_MAX/2 >= 0xFFFF
#define LOOP_COUNT 4
#else
#define LOOP_COUNT 5
#endif

uint64_t rand_uint64(void) {
  uint64_t r = 0;
  for (int i = LOOP_COUNT; i > 0; i--) {
    r = r * (RAND_MAX + (uint64_t) 1u) + ((unsigned) rand());
  }
  return r;
}

typedef float fp1;
typedef double fp2;

fp1 rand_fp1(void) {
  union {
    fp1 f;
    uint64_t u64;
  } u;
  do {
    u.u64 = rand_uint64();
  } while (!isfinite(u.f));
  return u.f;
}

int pre = DBL_DECIMAL_DIG - 1;


void exact_add(fp1 *sums, fp1 x, int expo);

// Add x to sums[expo]
// 0.5 <= |x| < 1
// both same sign.
void exact_fract_add(fp1 *sums, fp1 x, int expo) {
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);
  assert(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0);
  assert((sums[expo] > 0.0) == ( x > 0.0));

  fp1 half = x > 0.0 ? 0.5 : -0.5;
  fp1 sum = (sums[expo] - half) + (x - half);
  if (fabsf(sum) >= 0.5) {
    assert(fabsf(sums[expo]) < 1.0);
    sums[expo] = sum;
  } else  {
    sums[expo] = 0.0;
    if (sum) exact_add(sums, sum, expo);
  }
  exact_add(sums, half, expo+1);  // carry
}

// Add  x to sums[expo]
// 0.5 <= |x| < 1
// differing sign
void exact_fract_sub(fp1 *sums, fp1 x, int expo) {
  if(!(fabsf(x) >= 0.5 && fabsf(x) < 1.0)) {
    printf("%d %e\n", __LINE__, x);
    exit(-1);
  }
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);
  assert((sums[expo] > 0.0) != ( x > 0.0));
  fp1 dif = sums[expo] + x;
  sums[expo] = 0.0;
  exact_add(sums, dif, expo);
}

// Add x to sums[]
void exact_add(fp1 *sums, fp1 x, int expo) {
  if (x == 0) return;
  assert (x >= -FLT_MAX && x <= FLT_MAX);
  //while (fabsf(x) >= 1.0) { x /= 2.0; expo++; }
  while (fabsf(x) < 0.5) { x *= (fp1)2.0; expo--; }
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);

  if (sums[expo] == 0.0) {
    sums[expo] = x;
    return;
  }
  if(!(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0)) {
    printf("%e\n", sums[expo]);
    printf("%d %e\n", expo, x);
    exit(-1);
  }
  assert(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0);
  if ((sums[expo] > 0.0) == (x > 0.0)) {
    exact_fract_add(sums, x, expo);
  } else {
    exact_fract_sub(sums, x, expo);
  }
}

void exact_add_general(fp1 *sums, fp1 x) {
  if (x == 0) return;
  assert (x >= -FLT_MAX && x <= FLT_MAX);
  int expo;
  x = frexpf(x, &expo);
  exact_add(sums, x, expo);
}

void sum_of_sums(const char *s, const fp1 *sums, int expo_min, int expo_max) {
  fp1 sum1 = 0.0;
  fp2 sum2 = 0.0;
  int step = expo_max >= expo_min ? 1 : -1;
  for (int expo = expo_min; expo/step <= expo_max/step; expo += step) {
    sum1 += ldexpf(sums[expo], expo);
    sum2 += ldexp(sums[expo], expo);
  }
  printf("%-20s = %+.*e %+.*e\n", s, pre, sum2, pre, sum1);
}


int test_sum(size_t N) {
  fp1 a[N];
  fp1 sum1 = 0.0;
  fp2 sum2 = 0.0;
  for (size_t i = 0; i < N; i++) {
    a[i] = (fp1) rand_fp1();
    sum1 += a[i];
    sum2 += a[i];
  }
  printf("%-20s = %+.*e %+.*e\n", "initial  sums", pre, sum2, pre, sum1);

  int expo_min;
  int expo_max;
  frexpf(FLT_TRUE_MIN, &expo_min);
  frexpf(FLT_MAX, &expo_max);
  size_t ln2_size = SIZE_MAX;
  while (ln2_size > 0) {
    ln2_size >>= 1;
    expo_max++;
  };
  fp1 sum_memory[expo_max - expo_min + 1];
  memset(sum_memory, 0, sizeof sum_memory);  // set to 0.0 cheat
  fp1 *sums = &sum_memory[-expo_min];

  for (size_t i = 0; i<N; i++)  {
    exact_add_general(sums, a[i]);
  }
  sum_of_sums("post add  sums", sums, expo_min,  expo_max);

  // normalize
  int done;
  do {
    done = 1;
    for (int expo = expo_max; expo >= expo_min; expo--) {
      fp1 x = sums[expo];
      if ((x < -0.5) || (x > 0.5)) {
        //printf("xxx %4d %+.*e ", expo, 2, x);
        done = 0;
        if (x > 0.0) {
          sums[expo] = 0.5;
          exact_add(sums, x - (fp1)0.5, expo);
        } else {
          sums[expo] = -0.5;
          exact_add(sums, x - -(fp1)0.5, expo);
        }
      }
    }
    sum_of_sums("end  sums", sums, expo_min,  expo_max);
  } while (!done);

  for (int expo = expo_max; expo >= expo_min; expo--) {
    if (sums[expo]) {
      return (sums[expo] > 0.5) ? 1 : -1;
    }
  }
  return 0;
}

#define ITERATIONS 10000
#define MAX_NUMBERS_PER_SET 10000
int main() {
  unsigned seed = (unsigned) time(NULL);
  seed = 0;
  printf("seed = %u\n", seed);
  srand(seed);

  for (unsigned i = 0; i < ITERATIONS; i++) {
    int cmp = test_sum((size_t)rand() % MAX_NUMBERS_PER_SET + 1);
    printf("Compare %d\n\n", cmp);
    if (cmp == 0) break;
  }
  printf("Success");
  return EXIT_SUCCESS;
}

Infinities and NaN can also be handled, to a degree, leave that for later.

The floating point number resulting from the summation of 2 floating point numbers is only an approximation . Given i 1 and i 2 to sum we can find an approximation of the error in floating point summation by doing this:

i 1 + i 2 = i 12
i 12 - i 2 = i ~1
i 1 - i ~1 = i Δ

The closest approximation we could come up with for the summation of n numbers would be the to calculate the error for the n - 1 addition operations, then to sum those n - 1 errors again taking the n - 2 . And you'll repeat this process n - 2 times or until all the errors have gone to 0.0

There are a couple things that could be done to drive the error calculations to 0.0 faster:

  1. Use a larger floating point type, for example long double
  2. Sort the list prior to summing so that you're adding small numbers to small numbers and large numbers to large numbers

Now you can make an assessment of how important accuracy is to you. I will tell you that in the general case the computational expense of the above operation is outrageous considering the result you get will still be an approximation .

The generally accepted solution is Kahan's Summation it's a happy marriage between speed and precision. Rather than holding the error to the end of the summation Kahan will roll it into each addition, preventing it's value from escalating outside the hghest precision floating point range. Say that we're given vector<long double> i1 we could run Kahan's Summation on it as follows:

auto c = 0.0L;
const auto sum = accumulate(next(cbegin(i1)), cend(i1), i1.front(), [&](const auto& sum, const auto& input) {
    const auto y = input - c;
    const auto t = sum + y;

    c = t - sum - y;
    return t;
} ) - c;

One of the possibilities to perform this comparison with certainty is to create a class for fixed point arithmetic of precision equal to the types in use and without limit on absolute value.

It could be a class implementing following public methods:

    FixedPoint(double d);
    ~FixedPoint();

    FixedPoint operator+(const FixedPoint& rhs);
    FixedPoint operator-(const FixedPoint& rhs);
    bool isPositive();

(Every supported floating point type needs separate constructor.)

Depending upon circumstances implementation would require a collection of bool of fixed, decided upon construction or dynamic size; possibly std::bitset , vector<bool> or static or dynamic bool array.

For ease of implementation I would suggest implementing the 2's complement encoding.

This is an obvious and very costly solution, that would hurt performance if this comparison was core of some system. Hopefully there is a better solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM