How well does the Visual C++ 2008/2010 compiler optimize?

Question

Im just wondering how good the MSVC++ Compiler can optimize code(with Code examples) or what he can't optimize and why.

For example i used the SSE-intrinsics with something like this(var is an __m128 value)(it was for an frustrum-culling test):

if( var.m128_f32[0] > 0.0f && var.m128_f32[1] > 0.0f && var.m128_f32[2] > 0.0f && var.m128_f32[3] > 0.0f ) {
    ...
}

As i took a look at the asm-output i saw that it did compile to an ugly very jumpy version (and i know that the CPU's just hate tight jumps) and i know also that i can optimize it with the SSE4.1 PTEST instruction, but why did the compiler not do it(even if the compiler writers defined the PTEST intrinsic, so they knew the instruction)?

What optimizations can't it do too (until now).

Does this imply that im with the todays technology forced to use intrinsics and inline ASM and linked ASM functions and will compilers ever find such things(i don't think so)?

Where can i read more about how good the MSVC++ compiler optimizes?

(Edit 1): I used the SSE2 switch and FP:fast switch

Answer 1

The default for the compiler is set to generate code that wil run on a 'lowest common denominator' CPU - ie one without SSE 4.1 instructions.

You can change that by targetting later CPUs only in the build options.

That said, the MS compiler is traditionally 'not the best' when it comes to SSE optimisation . I'm not even sure if it supports SSE 4 at all. That link gives good credit to GCC for SSE optimisation:

As a side note about GCC's near perfection in code generation – I was quite surprised seeing it surpass even Intel's own compiler

Perhaps you need to change compiler!

Answer 2

You might want to try Intel's ICC compiler - in my experience it generates a lot better code than Visual C++, especially for SSE code. You can get a free 30 day evaluation license from intel.com.

Answer 3

您可以激活已编译代码的asm视图，并查看自己生成的内容。

Answer 4

Check the presentation at http://lambda-the-ultimate.org/node/3674

Summary: Compilers generally do lots of amazing tricks now, even things that doesn't seem to be generally related to imperative programming, like tail-call optimization. MSVC++ is not the best, still it seems pretty good.

Answer 5

Ïf-statements generate conditional jumps unless you can utilize conditional moves but that is more likely something done in hand-written assembly. There are rules that govern the CPU's conditional jump assumptions (branch prediction) such that the penalty of a conditional jump which behaves along the rules is acceptable. Then there is out-of-order execution to additionally complicate things :). The bottom line is that if your code is straight-forward the jumps which eventually occur won't mess up performance. You might check out Agner Fog's optimization pages .

A non-debug compilation of your C-code specifically should generate four conditional jumps. The logical ands (&&) and parentheses usage will result in a left-to-right testing so one C optimization could be to test the f32 that is most likely to be >0.0f first (if such a probability can be determined). You have five possible execution variants: test1 true branch taken (t1tbt), test1 false no branch (t1fnb) test2 true branch taken (t2tbt), etc giving the following possible sequences

t1tbt                      ; var.m128_f32[0] <= 0.0f
t1fnb t2tbt                ; var.m128_f32[0] >  0.0f, var.m128_f32[1] <= 0.0f
t1fnb t2fnb t3tbt          ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f,
                           ; var.m128_f32[2] <= 0.0f
t1fnb t2fnb t3fnb t4tbt    ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f,
                           ; var.m128_f32[2] >  0.0f, var.m128_f32[3] <= 0.0f
t1fnb t2fnb t3fnb t4fnb    ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f
                           ; var.m128_f32[2] >  0.0f, var.m128_f32[3] >  0.0f

Only a taken branch will result in a pipelining disruption and branch prediction will minimize the disruption as much as possible.

Assuming floats are expensive to test (they are), if var is a union and you are well-versed in floating-point ins and outs you might consider doing integer testing on the overlapping types. For example the stored value 1.0f occupies four bytes stored as 0x00, 0x00, 0x80, 0x3f (x86/little-endian). Reading this value as a long integer will give 0x3f800000 or +1065353216. 0.0f is 0x00, 0x00, 0x00, 0x00 or 0x00000000 (long). Negative float values have exactly the same format as positive with the exception that the highest bit is set (0x80000000).

How well does the Visual C++ 2008/2010 compiler optimize?

Question

5 answers

solution1
4 ACCPTED 2010-07-14 23:04:05

solution2
2 2010-07-14 22:28:46

solution3
1 2010-07-14 22:29:40

solution4
0 2010-07-14 22:46:44

solution5
0 2011-04-14 08:23:32

How well does the Visual C++ 2008/2010 compiler optimize?

Question

5 answers

solution1 4 ACCPTED 2010-07-14 23:04:05

solution2 2 2010-07-14 22:28:46

solution3 1 2010-07-14 22:29:40

solution4 0 2010-07-14 22:46:44

solution5 0 2011-04-14 08:23:32

solution1
4 ACCPTED 2010-07-14 23:04:05

solution2
2 2010-07-14 22:28:46

solution3
1 2010-07-14 22:29:40

solution4
0 2010-07-14 22:46:44

solution5
0 2011-04-14 08:23:32