简体   繁体   中英

What optimizations should be left for compiler?

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:

v[i*3+0] , v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total k.netic energy. Given that all particles are of the same mass, one may write:

inline double sqr(double x)
{
    return x*x;
}

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i++)
        sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);

    return 0.5 * mass * sum;
}

To reduce the number of multiplications, it can be written as:

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i++)
    {
        double *w = v + i*3;
        sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
    }

    return 0.5 * mass * sum;
}

(one may write a function with even fewer multiplications, but that's not the point of this question)

Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?

where should the developer rely on the compiler and where should she/he try to do some optimization manually?

  1. Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.

  2. Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.

  3. When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...

From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).

One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.

I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.

This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.

Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.

As a matter of fact, the code generated handles 2 components at a time using XMM registers.

If performance is a must, then here are steps that will save the day:

  • the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.

  • if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.

  • try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.

    This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.

Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.

Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):

double get_kinetic_energy(const double v[], int n, double mass)
{
    double sum = 0.0;

    for (int i = 0; i < 3 * n; i++)
        sum += v[i] * v[i];

    return 0.5 * mass * sum;
}

Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.

They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.

Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.

For example, given

void test(char *p)
{
    char *e = p+5;
    do
    {
        p[0] = p[1];
        p++;
    }while(p < e);
}

when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7); . While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.

What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM