简体   繁体   English

哪些优化应该留给编译器?

[英]What optimizations should be left for compiler?

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:假设您选择了最有效的算法来解决性能是第一要务的问题,现在您正在实施它,您必须决定如下细节:

v[i*3+0] , v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total k.netic energy. v[i*3+0]v[i*3+1]v[i*3+2]包含粒子i的速度分量,我们要计算总 k.netic 能量。 Given that all particles are of the same mass, one may write:鉴于所有粒子都具有相同的质量,可以这样写:

inline double sqr(double x)
{
    return x*x;
}

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i++)
        sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);

    return 0.5 * mass * sum;
}

To reduce the number of multiplications, it can be written as:为了减少乘法次数,可以写成:

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i++)
    {
        double *w = v + i*3;
        sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
    }

    return 0.5 * mass * sum;
}

(one may write a function with even fewer multiplications, but that's not the point of this question) (一个人可能会写一个 function 乘法更少,但这不是这个问题的重点)

Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?现在我的问题是:由于许多 C 编译器可以自动进行这种优化,开发人员应该在哪里依赖编译器以及她/他应该在哪里尝试手动进行一些优化?

where should the developer rely on the compiler and where should she/he try to do some optimization manually?开发人员应该在哪里依赖编译器,她/他应该在哪里尝试手动进行一些优化?

  1. Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler?我是否对目标硬件以及 C 代码如何转换为汇编程序有相当深入的了解? If no, forget about manual optimizations.如果不是,请忘记手动优化。

  2. Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place?这段代码中是否存在任何明显的瓶颈——我怎么知道它首先需要优化? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.明显的罪魁祸首是 I/O、复杂循环、忙等待循环、朴素算法等。

  3. When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself?当我发现这个瓶颈时,我究竟是如何对它进行基准测试的,我确定问题不在于基准测试方法本身吗? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. SO 的经验表明,10 个奇怪的性能问题中有 9 个可以用不正确的基准测试来解释。 Including: benchmarking with compiler optimizations disabled...包括:禁用编译器优化的基准测试......

From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer.从那里开始,您可以开始查看特定于系统的内容以及算法本身 - SO 答案中要查看的内容太多了。 It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).为低端微控制器和 64 位台式 PC(以及介于两者之间的一切)优化代码之间存在巨大差异。

One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.有一件事看起来有点像过早的优化,但可能只是对语言能力的无知,那就是你拥有所有的信息来描述扁平化为double值数组的粒子。

I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle.相反,我建议您将其分解,通过创建一个结构来保存每个粒子上的三个数据点,从而使您的代码更易于阅读。 At that point you can create functions which take a single particle or multiple particles and do computations on them.到那时,您可以创建接受单个粒子或多个粒子并对它们进行计算的函数。

This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array.与必须将三倍数量的粒子 arguments 传递给函数或尝试“切片”数组相比,这对您来说要容易得多。 If it's easier for you to reason about, you're less likely to generate warnings/errors.如果你更容易推理,你就不太可能产生警告/错误。

Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain.查看gccclang如何处理您的代码,您考虑的微优化是徒劳的。 The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.编译器已经应用了标准的公共子表达式消除技术,这些技术可以消除您试图消除的开销。

As a matter of fact, the code generated handles 2 components at a time using XMM registers.事实上,生成的代码使用 XMM 寄存器一次处理 2 个组件。

If performance is a must, then here are steps that will save the day:如果性能是必须的,那么以下步骤可以挽救局面:

  • the real judge is the wall clock.真正的法官是挂钟。 Write a benchmark with realistic data and measure performance until you get consistent results.使用真实数据编写基准并衡量性能,直到获得一致的结果。

  • if you have a profiler, use it to determine where are the bottlenecks if any.如果您有分析器,请使用它来确定瓶颈在哪里(如果有的话)。 Changing algorithms for those parts that appear to hog performance is an effective approach.为那些似乎会影响性能的部分更改算法是一种有效的方法。

  • try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system.尝试从编译器中获得最佳效果:研究优化选项,并尝试让编译器使用更积极的技术(如果它们适合目标系统)。 For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.例如-mavx512f -mavx512cdgcc使用 512 位 ZMM 寄存器生成一次处理 8 个组件的代码。

    This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.这是一种非侵入式技术,因为代码不会更改,因此您不必冒险通过手动优化代码来引入新错误。

Optimisation is a difficult art.优化是一门困难的艺术。 In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.根据我的经验,与添加额外的微妙内容以牺牲可读性和正确性为代价来尝试提高性能相比,简化代码可以获得更好的结果和更少的错误。

Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):查看代码,明显的简化似乎会产生相同的结果,并且可能有助于优化器的工作(但同样,让挂钟来判断):

double get_kinetic_energy(const double v[], int n, double mass)
{
    double sum = 0.0;

    for (int i = 0; i < 3 * n; i++)
        sum += v[i] * v[i];

    return 0.5 * mass * sum;
}

Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.像 clang 和 gcc 这样的编译器同时比许多人认为的更强大和更不强大。

They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.他们有非常广泛的模式,他们可以将代码转换成另一种形式,这种形式可能更高效并且仍然按要求运行。

Neither, however, is especially good at judging when optimizations will be useful.然而,两者都不是特别擅长判断优化何时有用。 Both are prone to making some "optimization" decisions that are almost comically absurd.两者都倾向于做出一些近乎滑稽荒谬的“优化”决定。

For example, given例如,给定

void test(char *p)
{
    char *e = p+5;
    do
    {
        p[0] = p[1];
        p++;
    }while(p < e);
}

when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);当针对优化级别低于 -O2 的 Cortex-M0 时,gcc 10.2.1 将生成等同于调用memmove(p, p+1, 7); . . While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.虽然理论上 memmove 的库实现可能会优化 n==7 的情况,从而胜过在-Og (或什至 -O0)处生成的基于五指令字节的循环,但这似乎很遥远更有可能的是,任何合理的实现都会花一些时间分析需要做什么,然后在完成之后花费与使用 -O0 生成的代码一样长的时间来执行循环。

What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.本质上,发生的事情是 gcc 分析循环,弄清楚它试图做什么,然后使用它自己的方法以一种可能比程序员试图做的更好也可能不会更好的方式执行该操作首先。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM