简体   繁体   English

Visual C ++ 2008/2010编译器的优化程度如何?

[英]How well does the Visual C++ 2008/2010 compiler optimize?

Im just wondering how good the MSVC++ Compiler can optimize code(with Code examples) or what he can't optimize and why. 我只是想知道MSVC ++编译器能够优化代码(使用代码示例)或者他无法优化的原因以及原因。

For example i used the SSE-intrinsics with something like this(var is an __m128 value)(it was for an frustrum-culling test): 例如,我使用SSE-intrinsics就像这样(var是一个__m128值)(它用于一个frustrum-culling测试):

if( var.m128_f32[0] > 0.0f && var.m128_f32[1] > 0.0f && var.m128_f32[2] > 0.0f && var.m128_f32[3] > 0.0f ) {
    ...
}

As i took a look at the asm-output i saw that it did compile to an ugly very jumpy version (and i know that the CPU's just hate tight jumps) and i know also that i can optimize it with the SSE4.1 PTEST instruction, but why did the compiler not do it(even if the compiler writers defined the PTEST intrinsic, so they knew the instruction)? 当我看到asm-output时,我看到它确实编译成一个丑陋的非常跳跃的版本(我知道CPU只是讨厌紧急跳转)而且我也知道我可以使用SSE4.1 PTEST指令进行优化,但为什么编译器不这样做(即使编译器编写者定义了PTEST内在函数,所以他们知道指令)?

What optimizations can't it do too (until now). 什么优化也不能(直到现在)。

Does this imply that im with the todays technology forced to use intrinsics and inline ASM and linked ASM functions and will compilers ever find such things(i don't think so)? 这是否意味着我对今天的技术被迫使用内在函数和内联ASM以及链接的ASM函数,并且编译器会找到这样的东西(我不这么认为)?

Where can i read more about how good the MSVC++ compiler optimizes? 哪里可以阅读更多有关MSVC ++编译器优化程度的信息?

(Edit 1): I used the SSE2 switch and FP:fast switch (编辑1):我使用了SSE2开关和FP:快速开关

The default for the compiler is set to generate code that wil run on a 'lowest common denominator' CPU - ie one without SSE 4.1 instructions. 编译器的默认设置是生成将在“最小公分母”CPU上运行的代码 - 即没有SSE 4.1指令的代码。

You can change that by targetting later CPUs only in the build options. 您可以通过仅在构建选项中定位以后的CPU来更改它。

That said, the MS compiler is traditionally 'not the best' when it comes to SSE optimisation . 也就是说,在SSE优化方面 ,MS编译器传统上“不是最好的”。 I'm not even sure if it supports SSE 4 at all. 我甚至不确定它是否支持SSE 4。 That link gives good credit to GCC for SSE optimisation: 该链接为GCC的SSE优化提供了良好的信誉:

As a side note about GCC's near perfection in code generation – I was quite surprised seeing it surpass even Intel's own compiler 作为关于GCC在代码生成方面近乎完美的一个注意事项 - 我甚至惊讶地看到它超越了英特尔自己的编译器

Perhaps you need to change compiler! 也许你需要改变编译器!

You might want to try Intel's ICC compiler - in my experience it generates a lot better code than Visual C++, especially for SSE code. 您可能想尝试使用英特尔的ICC编译器 - 根据我的经验,它可以生成比Visual C ++更好的代码,特别是对于SSE代码。 You can get a free 30 day evaluation license from intel.com. 您可以从intel.com获得免费的30天评估许可。

您可以激活已编译代码的asm视图,并查看自己生成的内容。

Check the presentation at http://lambda-the-ultimate.org/node/3674 查看演示文稿, 网址http://lambda-the-ultimate.org/node/3674

Summary: Compilers generally do lots of amazing tricks now, even things that doesn't seem to be generally related to imperative programming, like tail-call optimization. 简介:编译器现在通常会做很多惊人的技巧,即使是通常与命令式编程无关的事情,比如尾部调用优化。 MSVC++ is not the best, still it seems pretty good. MSVC ++不是最好的,但看起来还不错。

Ïf-statements generate conditional jumps unless you can utilize conditional moves but that is more likely something done in hand-written assembly. 除非您可以使用条件移动,否则Ïf语句会生成条件跳转,但这更可能是在手写程序集中完成的。 There are rules that govern the CPU's conditional jump assumptions (branch prediction) such that the penalty of a conditional jump which behaves along the rules is acceptable. 有一些规则可以控制CPU的条件跳转假设(分支预测),这样可以接受沿着规则行事的条件跳转的惩罚。 Then there is out-of-order execution to additionally complicate things :). 然后是无序执行,以使事情复杂化:)。 The bottom line is that if your code is straight-forward the jumps which eventually occur won't mess up performance. 最重要的是,如果您的代码是直接的,那么最终发生的跳转不会影响性能。 You might check out Agner Fog's optimization pages . 您可以查看Agner Fog的优化页面

A non-debug compilation of your C-code specifically should generate four conditional jumps. 特别是C代码的非调试编译应该生成四个条件跳转。 The logical ands (&&) and parentheses usage will result in a left-to-right testing so one C optimization could be to test the f32 that is most likely to be >0.0f first (if such a probability can be determined). 逻辑ands(&&)和括号用法将导致从左到右的测试,因此一个C优化可以是测试最有可能> 0.0f的f32(如果可以确定这样的概率)。 You have five possible execution variants: test1 true branch taken (t1tbt), test1 false no branch (t1fnb) test2 true branch taken (t2tbt), etc giving the following possible sequences 您有五种可能的执行变体:test1 true branch taken(t1tbt),test1 false no branch(t1fnb)test2 true branch taken(t2tbt)等给出以下可能的序列

t1tbt                      ; var.m128_f32[0] <= 0.0f
t1fnb t2tbt                ; var.m128_f32[0] >  0.0f, var.m128_f32[1] <= 0.0f
t1fnb t2fnb t3tbt          ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f,
                           ; var.m128_f32[2] <= 0.0f
t1fnb t2fnb t3fnb t4tbt    ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f,
                           ; var.m128_f32[2] >  0.0f, var.m128_f32[3] <= 0.0f
t1fnb t2fnb t3fnb t4fnb    ; var.m128_f32[0] >  0.0f, var.m128_f32[1] >  0.0f
                           ; var.m128_f32[2] >  0.0f, var.m128_f32[3] >  0.0f

Only a taken branch will result in a pipelining disruption and branch prediction will minimize the disruption as much as possible. 只有一个被采用的分支将导致流水线中断,并且分支预测将尽可能地减少中断。

Assuming floats are expensive to test (they are), if var is a union and you are well-versed in floating-point ins and outs you might consider doing integer testing on the overlapping types. 假设浮点数的测试成本很高(它们是),如果var是一个联合,并且你精通浮点输入和输出,你可以考虑对重叠类型进行整数测试。 For example the stored value 1.0f occupies four bytes stored as 0x00, 0x00, 0x80, 0x3f (x86/little-endian). 例如,存储值1.0f占用存储为0x00,0x00,0x80,0x3f(x86 / little-endian)的四个字节。 Reading this value as a long integer will give 0x3f800000 or +1065353216. 将此值读取为长整数将给出0x3f800000或+1065353216。 0.0f is 0x00, 0x00, 0x00, 0x00 or 0x00000000 (long). 0.0f是0x00,0x00,0x00,0x00或0x00000000(长)。 Negative float values have exactly the same format as positive with the exception that the highest bit is set (0x80000000). 负浮点值具有与正数完全相同的格式,但设置最高位(0x80000000)除外。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM