简体繁体 English

超出-O3 / -Ofast的G ++优化

[英]G++ optimization beyond -O3/-Ofast

原文 2013-01-24 01:27:41 9 8 c++/ g++/ compiler-optimization

The Problem 问题

We have a mid-sized program for a simulation task, that we need to optimize. 我们有一个用于模拟任务的中型程序，我们需要对其进行优化。 We have already done our best optimizing the source to the limit of our programming skills, including profiling with Gprof and Valgrind . 我们已经尽最大努力优化我们的编程技能，包括使用Gprof和Valgrind进行分析。

When finally finished, we want to run the program on several systems probably for some months. 最后完成后，我们希望在几个系统上运行该程序可能已有几个月了。 Therefore we are really interested in pushing the optimization to the limits. 因此，我们真的很有兴趣将优化推向极限。

All systems will run Debian/Linux on relatively new hardware (Intel i5 or i7). 所有系统都将在相对较新的硬件（Intel i5或i7）上运行Debian / Linux。

The Question 问题

What are possible optimization options using a recent version of g++, that go beyond -O3/-Ofast? 使用最新版本的g ++有哪些可能的优化选项，超出-O3 / -Ofast？

We are also interested in costly minor optimization, that will payout in the long run. 我们也对昂贵的小优化感兴趣，从长远来看，这将是非常重要的。

What we use right now 我们现在用的是什么

Right now we use the following g++ optimization options: 现在我们使用以下g ++优化选项：

-Ofast : Highest "standard" optimization level. -Ofast ：最高“标准”优化级别。 The included -ffast-math did not cause any problems in our calculations, so we decided to go for it, despite of the non standard-compliance. 包含-ffast-math在我们的计算中没有引起任何问题，因此我们决定采用它，尽管非标准符合性。
-march=native : Enabling the use of all CPU specific instructions. -march=native ：启用所有CPU特定指令。
-flto to allow link time optimization, across different compilation units. -flto允许跨不同的编译单元进行链接时间优化。

8 个解决方案

Most of the answers suggest alternative solutions, such as different compilers or external libraries, which would most likely bring a lot of rewriting or integration work. 大多数答案都提出了替代解决方案，例如不同的编译器或外部库，这很可能会带来大量的重写或集成工作。 I will try to stick to what the question is asking, and focus on what can be done with GCC alone, by activating compiler flags or doing minimal changes to the code, as requested by the OP. 我将尽力坚持问题所针对的问题，并着眼于单独使用GCC可以做什么，通过激活编译器标志或对OP进行最小的代码更改。 This is not a "you must do this" answer, but more a collection of GCC tweaks that have worked out well for me and that you can give a try if they are relevant in your specific context. 这不是“你必须这样做”的答案，而是更多GCC调整的集合，这些调整对我来说很好，并且如果它们在您的特定环境中相关，您可以尝试一下。

Warnings regarding original question 关于原始问题的警告

Before going into the details, a few warning regarding the question, typically for people who will come along, read the question and say "the OP is optimising beyond O3, I should use the same flags than he does!". 在进入细节之前，关于这个问题的一些警告，通常对于那些将会出现的人，阅读问题并说“OP优化超出O3，我应该使用与他相同的标志！”。

-march=native enables usage of instructions specific to a given CPU architecture , and that are not necessarily available on a different architecture. -march=native允许使用特定于给定CPU体系结构的指令，并且不一定在不同的体系结构上可用。 The program may not work at all if run on a system with a different CPU, or be significantly slower (as this also enables mtune=native ), so be aware of this if you decide to use it. 如果在具有不同CPU的系统上运行，或者速度明显较慢（因为这也使得mtune=native ），程序可能根本不起作用，因此如果您决定使用它，请注意这一点。 More information here . 更多信息在这里。
-Ofast , as you stated, enables some non standard compliant optimisations, so it should used with caution as well. 正如您所说， -Ofast可以实现一些非标准兼容的优化，因此也应谨慎使用。 More information here . 更多信息在这里。

Other GCC flags to try out 尝试其他GCC标志

The details for the different flags are listed here . 此处列出了不同标志的详细信息。

-Ofast enables -ffast-math , which in turn enables -fno-math-errno , -funsafe-math-optimizations , -ffinite-math-only , -fno-rounding-math , -fno-signaling-nans and -fcx-limited-range . -Ofast启用-ffast-math ，反过来启用-fno-math-errno ， -funsafe-math-optimizations ， -ffinite-math-only ， -fno-rounding-math ， -fno-signaling-nans -fcx-limited-range和-fcx-limited-range 。 You can go even further on floating point calculation optimisations by selectively adding some extra flags such as -fno-signed-zeros , -fno-trapping-math and others. 您可以通过选择性地添加一些额外的标志（例如-fno-signed-zeros ， -fno-trapping-math等）来进一步研究浮点计算优化 。 These are not included in -Ofast and can give some additional performance increases on calculations, but you must check whether they actually benefit you and don't break any calculations. 这些不包含在-Ofast并且可以在计算上提供一些额外的性能提升，但您必须检查它们是否真正使您受益并且不会破坏任何计算。
GCC also features a large amount of other optimisation flags which aren't enabled by any "-O" options. GCC还具有大量其他优化标志 ，这些标志未被任何“-O”选项启用。 They are listed as "experimental options that may produce broken code", so again, they should be used with caution, and their effects checked both by testing for correctness and benchmarking. 它们被列为“可能产生破坏代码的实验选项”，因此，应谨慎使用它们，并通过测试正确性和基准测试来检查它们的效果。 Nevertheless, I do often use -frename-registers , this option has never produced unwanted results for me and tends to give a noticeable performance increase (ie. can be measured when benchmarking). 不过，我经常使用-frename-registers ，这个选项从来没有给我带来不必要的结果，并且往往会带来明显的性能提升（即可以在基准测试时测量）。 This is the type of flag that is very dependant on your processor though. 这是一种非常依赖于您的处理器的标志类型。 -funroll-loops also sometimes gives good results (and also implies -frename-registers ), but it is dependent on your actual code. -funroll-loops有时也会产生很好的结果（也暗示-frename-registers ），但它取决于你的实际代码。

PGO PGO

GCC has Profile-Guided Optimisations features. GCC具有Profile-Guided Optimisations功能。 There isn't a lot of precise GCC documentation about it, but nevertheless getting it to run is quite straightforward. 关于它没有很多精确的GCC文档，但是让它运行起来非常简单。

first compile your program with -fprofile-generate . 首先用-fprofile-generate编译你的程序。
let the program run (the execution time will be significantly slower as the code is also generating profile information into .gcda files). 让程序运行（执行时间将显着减慢，因为代码也生成.gcda文件中的配置文件信息）。
recompile the program with -fprofile-use . 使用-fprofile-use重新编译程序。 If your application is multi-threaded also add the -fprofile-correction flag. 如果您的应用程序是多线程的，还要添加-fprofile-correction标志。

PGO with GCC can give amazing results and really significantly boost performance (I've seen a 15-20% speed increase on one of the projects I was recently working on). 拥有GCC的PGO可以给出惊人的结果，并且真正显着提升性能（我看到我最近在其中一个项目上加速了15-20％）。 Obviously the issue here is to have some data that is sufficiently representative of your application's execution, which is not always available or easy to obtain. 显然，这里的问题是有一些数据足以代表您的应用程序的执行，而这些数据并不总是可用或很容易获得。

GCC's Parallel Mode GCC的并行模式

GCC features a Parallel Mode , which was first released around the time where the GCC 4.2 compiler was out. GCC采用并行模式 ，这是在GCC 4.2编译器出来的时候首次发布的。

Basically, it provides you with parallel implementations of many of the algorithms in the C++ Standard Library . 基本上，它为您提供了C ++标准库中许多算法的并行实现 。 To enable them globally, you just have to add the -fopenmp and the -D_GLIBCXX_PARALLEL flags to the compiler. 要全局启用它们，只需将-fopenmp和-D_GLIBCXX_PARALLEL标志添加到编译器即可。 You can also selectively enable each algorithm when needed, but this will require some minor code changes. 您还可以在需要时有选择地启用每个算法，但这需要进行一些次要的代码更改。

All the information about this parallel mode can be found here . 有关此并行模式的所有信息都可以在此处找到。

If you frequently use these algorithms on large data structures, and have many hardware thread contexts available, these parallel implementations can give a huge performance boost. 如果您经常在大型数据结构上使用这些算法，并且有许多可用的硬件线程上下文，那么这些并行实现可以提供巨大的性能提升。 I have only made use of the parallel implementation of sort so far, but to give a rough idea I managed to reduce the time for sorting from 14 to 4 seconds in one of my applications (testing environment: vector of 100 millions objects with custom comparator function and 8 cores machine). 到目前为止，我只使用了sort的并行实现，但是为了给出一个粗略的想法，我设法在我的一个应用程序中减少了从14秒到4秒的排序时间（测试环境：带有自定义比较器的1亿个对象的向量）功能和8核机器）。

Extra tricks 额外的技巧

Unlike the previous points sections, this part does require some small changes in the code . 与前面的要点部分不同，此部分确实需要对代码进行一些小的更改 。 They are also GCC specific (some of them work on Clang as well), so compile time macros should be used to keep the code portable on other compilers. 它们也是GCC特有的（其中一些也适用于Clang），因此应该使用编译时宏来保持代码在其他编译器上的可移植性。 This section contains some more advanced techniques, and should not be used if you don't have some assembly level understanding of what's going on. 本节包含一些更高级的技术，如果您对正在发生的事情没有一些汇编级别的理解，则不应使用此技术。 Also note that processors and compilers are pretty smart nowadays, so it may be tricky to get any noticeable benefit from the functions described here. 另请注意，处理器和编译器现在非常智能，因此从这里描述的功能中获得任何明显的好处可能会很棘手。

GCC builtins, which are listed here . GCC builtins，列在这里。 Constructs such as __builtin_expect can help the compiler do better optimisations by providing it with branch prediction information. 诸如__builtin_expect构造可以通过为编译器提供分支预测信息来帮助编译器进行更好的优化。 Other constructs such as __builtin_prefetch brings data into a cache before it is accessed and can help reducing cache misses . 其他构造（如__builtin_prefetch在访问数据之前会将数据带入缓存，并有助于减少缓存未命中 。
function attributes, which are listed here . 功能属性，在此处列出。 In particular, you should look into the hot and cold attributes; 特别是，你应该看看hot和cold属性; the former will indicate to the compiler that the function is a hotspot of the program and optimise the function more aggressively and place it in a special subsection of the text section, for better locality; 前者将向编译器指示该函数是程序的热点并且更积极地优化函数并将其放在文本部分的特殊子部分中，以获得更好的局部性; the later will optimise the function for size and place it in another special subsection of the text section. 后者将优化函数的大小，并将其放在文本部分的另一个特殊子部分。

I hope this answer will prove useful for some developers, and I will be glad to consider any edits or suggestions. 我希望这个答案对一些开发人员有用，我很乐意考虑任何编辑或建议。

relatively new hardware (Intel i5 or i7) 相对较新的硬件（Intel i5或i7）

Why not invest in a copy of the Intel compiler and high performance libraries? 为什么不投资英特尔编译器和高性能库的副本？ It can outperform GCC on optimizations by a significant margin, typically from 10% to 30% or even more, and even more so for heavy number-crunching programs. 它可以在优化方面优于GCC，通常从10％到30％甚至更高，对于繁重的数字运算程序更是如此。 And Intel also provide a number of extensions and libraries for high-performance number-crunching (parallel) applications, if that's something you can afford to integrate into your code. 英特尔还为高性能数字运算（并行）应用程序提供了许多扩展和库，如果这是你可以负担得起的代码。 It might payoff big if it ends up saving you months of running time. 如果最终节省你几个月的运行时间，它可能会带来巨大回报。

We have already done our best optimizing the source to the limit of our programming skills 我们已经尽最大努力优化我们的编程技能极限

In my experience, the kind of micro- and nano- optimizations that you typically do with the help of a profiler tend to have a poor return on time-investments compared to macro-optimizations (streamlining the structure of the code) and, most importantly and often overlooked, memory access optimizations (eg, locality of reference, in-order traversal, minimizing indirection, wielding out cache-misses, etc.). 根据我的经验，与宏优化相比，通常在分析器的帮助下进行的微观和纳米优化往往具有较差的时间投资回报（简化代码结构），最重要的是并且经常被忽略的是存储器访问优化（例如，参考的局部性，有序遍历，最小化间接，消除高速缓存未命中等）。 The latter usually involves designing the memory structures to better reflect the way the memory is used (traversed). 后者通常涉及设计存储器结构以更好地反映存储器的使用方式（遍历）。 Sometimes it can be as simple as switching a container type and getting a huge performance boost from that. 有时它可以像切换容器类型一样简单，并从中获得巨大的性能提升。 Often, with profilers, you get lost in the details of the instruction-by-instruction optimizations, and memory layout issues don't show up and are usually missed when forgetting to look at the bigger picture. 通常，对于分析器，您会逐渐失去指令优化的细节，并且内存布局问题不会出现，并且在忘记查看大图时通常会错过。 It's a much better way to invest your time, and the payoffs can be huge (eg, many O(logN) algorithms end up performing almost as slow as O(N) just because of poor memory layouts (eg, using a linked-list or linked-tree is a typical culprit of huge performance problems compared to a contiguous storage strategy)). 这是一种更好的投资时间的方式，并且收益可能很大（例如，许多O（logN）算法最终执行的速度几乎与O（N）一样慢，仅仅因为内存布局不佳（例如，使用链接列表）或者链接树是与连续存储策略相比的典型性能问题的典型罪魁祸首））。

If you can afford it, try VTune . 如果你负担得起，试试VTune 。 It provides MUCH more info than simple sampling (provided by gprof, as far as I know). 它比简单的采样提供了更多的信息（据我所知，由gprof提供）。 You might give the Code Analyst a try. 您可以尝试代码分析师。 Latter is a decent, free software but it might not work correctly (or at all) with Intel CPUs. Latter是一款不错的免费软件，但它可能无法正常（或根本不能）与英特尔CPU配合使用。

Being equipped with such tool, it allows you to check various measure such as cache utilization (and basically memory layout), which - if used to its full extend - provides a huge boost to efficiency. 配备这样的工具，它允许您检查各种措施，如缓存利用率（基本上是内存布局），如果完全使用它 - 提供了巨大的效率提升。

When you are sure that you algorithms and structures are optimal, then you should definitely use the multiple cores on i5 and i7. 如果您确定算法和结构是最佳的，那么您肯定应该在i5和i7上使用多个核心。 In other words, play around with different parallel programming algorithms/patterns and see if you can get a speed up. 换句话说，使用不同的并行编程算法/模式，看看你是否可以加快速度。

When you have truly parallel data (array-like structures on which you perform similar/same operations) you should give OpenCL and SIMD instructions (easier to set up) a try. 当您拥有真正的并行数据（类似于阵列的结构，您可以在其上执行类似/相同的操作）时，您应该尝试使用OpenCL和SIMD指令（更容易设置）。

huh, then final thing you may try: ACOVEA project: Analysis of Compiler Optimizations via an Evolutionary Algorithm -- as obvious from the description, it tries a genetic algorithm to pick the best compiler options for your project (doing compilation maaany times and check for timing, giving a feedback to the algorithm :) -- but results could be impressive! 呵呵，那么你可能会尝试最后的事情： ACOVEA项目：通过进化算法分析编译器优化 - 从描述中可以明显看出，它尝试遗传算法为您的项目选择最佳编译器选项（进行编译maaany时间并检查时间，给算法反馈:) - 但结果可能令人印象深刻！ :) :)

Some notes about the currently chosen answer (I do not have enough reputation points yet to post this as a comment): 关于当前所选答案的一些注释（我没有足够的声誉点来发布此评论）：

The answer says: 答案是：

-fassociative-math , -freciprocal-math , -fno-signed-zeros , and -fno-trapping-math . -fassociative-math ， -freciprocal-math ， -fno-signed-zeros和-fno-trapping-math 。 These are not included in -Ofast and can give some additional performance increases on calculations 这些不包含在-Ofast并且可以在计算上提供一些额外的性能提升

Perhaps this was true when the answer was posted, but the GCC documentation says that all of these are enabled by -funsafe-math-optimizations , which is enabled by -ffast-math , which is enabled by -Ofast . 当答案发布时，也许这是真的，但是GCC文档说所有这些都是由-funsafe-math-optimizations启用的，这是由-ffast-math启用的，它由-Ofast启用。 This can be checked with the command gcc -c -Q -Ofast --help=optimizer , which shows which optimizations are enabled by -Ofast , and confirms that all of these are enabled. 可以使用命令gcc -c -Q -Ofast --help=optimizer检查这一点，该命令显示-Ofast启用了哪些优化，并确认所有这些优化都已启用。

The answer also says: 答案还说：

other optimisation flags which aren't enabled by any "-O" options... -frename-registers 其他优化标志未被任何“-O”选项启用... -frename-registers

Again, the above command shows that, at least with my GCC 5.4.0, -frename-registers is enabled by default with -Ofast . 同样，上面的命令显示，至少在我的GCC 5.4.0中， -frename-registers默认启用了-Ofast 。

It is difficult to answer without further detail: 没有进一步细节很难回答：

what type of number crunching? 什么类型的数字嘎吱嘎吱？
what libraries are you using? 你在用什么图书馆？
what degree of paralelization? 什么程度的paralelization？

Can you write down the part of your code which takes the longest? 你能写下代码中耗时最长的部分吗？ (Typically a tight loop) （通常是紧密循环）

If you are CPU bound the answer will be different than if you are IO bound. 如果您受CPU限制，则答案将与您的IO绑定不同。

Again, please provide further detail. 再次，请提供进一步的细节。

I would recommend taking a look at the type of operations that costitute the heavy lifting, and look for an optimized library. 我建议看看繁重的操作类型，并寻找优化的库。 There are quite a lot of fast, assembly optimized, SIMD vectorized libraries out there for common problems (mostly math). 有很多快速，装配优化的SIMD矢量化库可以解决常见问题（主要是数学）。 Reinventing the wheel is often tempting, but it is usually not worth the effort if an existing soltuion can cover your needs.Since you have not stated what sort of simulation it is I can only provide some examples. 重新发明轮子往往是诱人的，但如果现有的解决方案可以满足您的需求，通常是不值得的。由于您没有说明它是什么样的模拟，我只能提供一些例子。

http://www.yeppp.info/ http://www.yeppp.info/

http://eigen.tuxfamily.org/index.php?title=Main_Page http://eigen.tuxfamily.org/index.php?title=Main_Page

https://github.com/xianyi/OpenBLAS https://github.com/xianyi/OpenBLAS

使用gcc intel转向/实现-fno-gcse（在gfortran上运行良好）和-fno-guess-branch-prbability（默认在gfortran中）