简体   繁体   English

g++ 中的优化级别 -O3 是否危险?

[英]Is optimisation level -O3 dangerous in g++?

I have heard from various sources (though mostly from a colleague of mine), that compiling with an optimisation level of -O3 in g++ is somehow 'dangerous', and should be avoided in general unless proven to be necessary.我从各种来源(尽管主要来自我的同事)听说,在 g++ 中以-O3的优化级别进行编译在某种程度上是“危险的”,除非证明有必要,否则通常应该避免。

Is this true, and if so, why?这是真的吗,如果是,为什么? Should I just be sticking to -O2 ?我应该坚持-O2吗?

In the early days of gcc (2.8 etc.) and in the times of egcs, and redhat 2.96 -O3 was quite buggy sometimes.在 gcc 的早期(2.8 等)和 egcs 时代,redhat 2.96 -O3 有时会出现很多问题。 But this is over a decade ago, and -O3 is not much different than other levels of optimizations (in buggyness).但这是十多年前的事了,-O3 与其他级别的优化(在错误方面)没有太大区别。

It does however tend to reveal cases where people rely on undefined behavior, due to relying more strictly on the rules, and especially corner cases, of the language(s).然而,它确实倾向于揭示人们依赖未定义行为的情况,因为更严格地依赖于语言的规则,尤其是极端情况。

As a personal note, I am running production software in the financial sector for many years now with -O3 and have not yet encountered a bug that would not have been there if I would have used -O2.作为个人说明,我使用 -O3 在金融部门运行生产软件多年,还没有遇到如果我使用 -O2 就不会出现的错误。

By popular demand, here an addition:应大众要求,这里补充一下:

-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. -O3 尤其是像 -funroll-loops(未由 -O3 启用)这样的附加标志有时会导致生成更多机器代码。 Under certain circumstances (eg on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of eg some inner loop now not fitting anymore into L1I.在某些情况下(例如在具有非常小的 L1 指令缓存的 cpu 上),这可能会导致速度变慢,因为例如某些内部循环的所有代码现在不再适合 L1I。 Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen.通常 gcc 会非常努力地不生成这么多代码,但由于它通常会优化通用情况,因此可能会发生这种情况。 Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage.特别容易出现这种情况的选项(如循环展开)通常不包含在 -O3 中,并在联机帮助页中进行了相应标记。 As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (eg when a profiler indicates L1I misses).因此,使用 -O3 生成快速代码通常是一个好主意,并且仅在适当的时候(例如,当分析器指示 L1I 未命中时)才回退到 -O2 或 -Os(尝试优化代码大小)。

If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations.如果您想将优化发挥到极致,您可以在 gcc 中通过 --param 调整与某些优化相关的成本。 Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.另外请注意,gcc 现在能够将属性放在控制这些函数的优化设置的函数中,因此当您发现在一个函数中使用 -O3 有问题时(或者想为该函数尝试特殊标志),您不需要使用 O2 编译整个文件甚至整个项目。

otoh it seems that care must be taken when using -Ofast, which states: otoh 似乎在使用 -Ofast 时必须小心,其中指出:

-Ofast enables all -O3 optimizations. -Ofast 启用所有 -O3 优化。 It also enables optimizations that are not valid for all standard compliant programs.它还支持对所有符合标准的程序都无效的优化。

which makes me conclude that -O3 is intended to be fully standards compliant.这让我得出结论,-O3 旨在完全符合标准。

In my somewhat checkered experience, applying -O3 to an entire program almost always makes it slower (relative to -O2 ), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache.在我有些曲折的经历中,将-O3应用于整个程序几乎总是会使其变慢(相对于-O2 ),因为它会开启激进的循环展开和内联,使程序不再适合指令缓存。 For larger programs, this can also be true for -O2 relative to -Os !对于较大的程序,对于-O2相对于-Os来说也是如此!

The intended use pattern for -O3 is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. -O3的预期使用模式是,在分析您的程序后,您手动将其应用于少量包含关键内部循环的文件,这些文件实际上受益于这些激进的空间与速度权衡。 Newer versions of GCC have a profile-guided optimization mode that can (IIUC) selectively apply the -O3 optimizations to hot functions -- effectively automating this process.较新版本的 GCC 有一个配置文件引导的优化模式,可以 (IIUC) 有选择地将-O3优化应用于热函数——有效地自动化这个过程。

-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels '-O2' and '-O1'. -O3 选项打开更昂贵的优化,例如函数内联,以及所有低级别“-O2”和“-O1”的优化。 The '-O3' optimization level may increase the speed of the resulting executable, but can also increase its size. “-O3”优化级别可能会提高生成的可执行文件的速度,但也可能会增加其大小。 Under some circumstances where these optimizations are not favorable, this option might actually make a program slower.在这些优化不利的某些情况下,此选项实际上可能会使程序变慢。

Yes, O3 is buggier.是的,O3 是越野车。 I'm a compiler developer and I've identified clear and obvious gcc bugs caused by O3 generating buggy SIMD assembly instructions when building my own software.我是一名编译器开发人员,在构建自己的软件时,我发现了由 O3 生成有问题的 SIMD 汇编指令引起的清晰而明显的 gcc 错误。 From what I've seen, most production software ships with O2 which means O3 will get less attention wrt testing and bug fixes.据我所知,大多数生产软件都附带 O2,这意味着 O3 将较少受到测试和错误修复的关注。

Think of it this way: O3 adds more transformations on top of O2, which adds more transformations on top of O1.可以这样想:O3 在 O2 之上添加了更多转换,这在 O1 之上添加了更多转换。 Statistically speaking, more transformations means more bugs.从统计上讲,更多的转换意味着更多的错误。 That's true for any compiler.对于任何编译器都是如此。

Recently I experienced a problem using optimization with g++ .最近我在使用g++优化时遇到了一个问题。 The problem was related to a PCI card, where the registers (for command and data) were repreented by a memory address.该问题与 PCI 卡有关,其中寄存器(用于命令和数据)由内存地址表示。 My driver mapped the physical address to a pointer within the application and gave it to the called process, which worked with it like this:我的驱动程序将物理地址映射到应用程序中的一个指针,并将其提供给被调用的进程,它像这样使用它:

unsigned int * pciMemory;
askDriverForMapping( & pciMemory );
...
pciMemory[ 0 ] = someCommandIdx;
pciMemory[ 0 ] = someCommandLength;
for ( int i = 0; i < sizeof( someCommand ); i++ )
    pciMemory[ 0 ] = someCommand[ i ];

The card didn't act as expected.该卡未按预期运行。 When I saw the assembly I understood that the compiler only wrote someCommand[ the last ] into pciMemory , omitting all preceding writes.当我看到程序集时,我明白编译器只将someCommand[ the last ]写入pciMemory ,省略了所有之前的写入。

In conclusion: be accurate and attentive with optimization.总而言之:对优化要准确和专心。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM