I tried: valgrind, _GLIBCXX_DEBUG, -fno-strict-aliasing; how do I debug this error?

Question

I have a really strange error that I've spend several days trying to figure out, and so now I want to see if anybody has any comments to help me understand what's happening.

Some background. I'm working on a software project which involves adding C++ extensions to Python 2.7.1 using Boost 1.45, so all my code is being run through the Python interpreter. Recently, I made a change to the code which broke one of our regression tests. This regression test is probably too sensitive to numerical fluctuations (eg different machines), so I should fix that. However, since this regression is breaking on the same machine/compiler that produced the original regression results, I traced the difference in results to this snippet of numerical code (which is verifiably unrelated to the code I changed):

c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
               - 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
       c[3],
       df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);

which constructs some numerical tables. Note that:

%a prints provides an exact ascii representation
The left hand side (lhs) is c[3], and the rhs is the other 8 values.
The output below was for values of i that were far from the boundaries of f, df
this code exists within a loop over i, which itself nested several layers (so I'm unable to provide an isolated case to reproduce this).

So I cloned my source tree, and the only difference between the two executables I compile is that the clone includes some extra code which isn't even executed in this test. This makes me suspect that it must be a memory problem, since the only difference should be where the code exists in memory... Anyway, when I run the two executables, here's the difference in what they produce:

diff new.out old.out 
655,656c655,656
<  6  -0x1.7c2a5a75fc046p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
<  7   -0x1.a18f0b3a3eb8p-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
---
>  6  -0x1.7c2a5a75fc006p-10 :                  0x0p+0                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4
>  7  -0x1.a18f0b3a3ec5cp-10 :                  0x0p+0                  0x0p+0   -0x1.75eee7aa9b8ddp-7   -0x1.a4acc49fef001p-6 :    0x1.304ec13281eccp-4    0x1.304ec13281eccp-4    0x1.1eaea08b55205p-4    0x1.9f6a9bc4559cdp-5
<more output truncated>

You can see that the value in c[3] is subtly different, while none of the rhs values are different. So some how identical input is giving rise to different output. I tried simplifying the rhs expression, but any change I make eliminates the difference. If I print &c[3], then the difference goes away. If I run on two different machines (linux, osx) I have access to, there's no difference. Here's what I've already tried:

valgrind (reported numerous problems in python, but nothing in my code, and nothing that looked serious)
-D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (but nothing asserts)
-fno-strict-aliasing (but I do get aliasing compile warnings out of the boost code)

I tried switching from gcc 4.1.2 to gcc 4.5.2 on the machine that has the problem, and this specific, isolated difference goes away (but the regression still fails, so let's assume that's a different problem).

Is there anything I can do to isolate the problem further? For future reference, is there any way to analyze or understand this kind of problem quicker? For example, given my description of lhs changing even though rhs is not, what would you conclude?

EDIT: The problem was entirely due to -ffast-math .

Answer 1

You can change the type of floating-point data of your program. If you use float, you can switch to double; if c , f , df is double, you can switch to long double (80bit on intel; 128 on sparc). For 4.5.2 you can even try to use a _float128 (128bit) software-simulated type.

The rounding error will be less with longer floating-point type.

Why adding some code (even unexecuted) changes the result? The gcc may compile programm differently if the code size changes. There are a lot of heuristics inside the GCC and some heuristics are based on function sizes. So gcc may compile you function in different way.

Also, try to compile your project with flag -mfpmath=sse -msse2 because using x87 (default fpmath for older gcc) is http://gcc.gnu.org/wiki/x87note

by default x87 arithmetic is not true 64/32 bit IEEE

PS: you should not use -ffast-math -like options when you are interested in stable numberic results: http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Optimize-Options.html

-ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -fno-trapping-math, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and fcx-limited-range.

This option causes the preprocessor macro FAST_MATH to be defined.

This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

This part of fast-math may change results

-funsafe-math-optimizations Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards . When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

This part will hide the traps and NaN-like errors from user (sometime user want to get all traps exactly to debug his code)

-fno-trapping-math Compile code assuming that floating-point operations cannot generate user-visible traps . These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option implies -fno-signaling-nans. Setting this option may allow faster code if one relies on “non-stop” IEEE arithmetic, for example.

This part of fast math says, that compiler can assume a default rounding mode anywhere (which can be false for some programms):

-fno-rounding-math Enable transformations and optimizations that assume default floating point rounding behavior . This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. ... This option enables constant folding of floating point expressions at compile-time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes.

I tried: valgrind, _GLIBCXX_DEBUG, -fno-strict-aliasing; how do I debug this error?

Question

1 answers

solution1
4 ACCPTED 2011-07-21 06:18:27

I tried: valgrind, _GLIBCXX_DEBUG, -fno-strict-aliasing; how do I debug this error?

Question

1 answers

solution1 4 ACCPTED 2011-07-21 06:18:27

solution1
4 ACCPTED 2011-07-21 06:18:27