简体   繁体   中英

C++ loop unrolling for compile time constant small values

I have these 2 functions:

template<int N>
void fun()
{
    for(int i = 0; i < N; ++i)
    {
        std::cout<<i<<" ";
    }
}

void gun(int N)
{
    for(int i = 0; i < N; ++i)
    {
        std::cout<<i<<" ";
    }
}

May I assume that in the first version the compiler will optimize the loop for every small N(by small I mean N = {1, 2, 3, 4})?

May I assume that in the first version the compiler will optimize the loop for every small N

That is a typical optimization, although "assume" is a strong word. If an optimization is imperative you will eventually be disappointed by any potential optimization.

Your second version may experience the same optimization if the compiler is able to inline the function.

You never have any guarantees as to what the optimization will do, but given a suitable optimization level, you can usually rely on it making better choices than you would if optimizing manually.

if you really want to know what code is produced, you can always take a look at the resulting assembly.

It depends on your optimization level and flags. There's a big difference between -O0 -g (no optimization, debugging enabled), -O3 (aggresively optimize for speed), and -Os (optimize for space).

These days loop unrolling isn't necessarily a win, even when optimizing for speed. Too much code can cause an instruction cache miss which will greatly outweigh the speedup of inlining a simple loop. And the cost of the conditional branch in a loop like this is almost negligible since branch prediction will correctly anticipate all but the last iteration.

If the compiler can inline either of the functions, it will also unroll the loop if it thinks that's the right thing to do. When & how a compiler decides there is a benefit in unrolling a loop is quite a complex matter, and depends highly on other factors, such as the number of available registers, what happens inside the loop (I doubt the example given above, for example, would gain much in time from reducing the 5 or so instructions involved in the loop, given that cout ... will probably consume several thousand times as much time - whether the compiler can figure that out or not is another matter, but it isn't entirely unknown for compilers to have SOME understanding of whether a function is small or not.

On the other hand, if the code looks something like this:

int arr[N];  // Global array. 

template<int N>
int fun()
{
    int sum = 0;
    for(int i = 0; i < N; ++i)
    {
        sum += arr[i];
    }
}

Then I would expect the compiler to unroll the loop to be something like this:

    int *tmp = arr;
    sum += *tmp++;
    sum += *tmp++;
    sum += *tmp++;
    sum += *tmp++;
    sum += *tmp++;

Assuming N = 5.

And this applies to any function that is "visible" to the compiler and where N is known at compile-time. So, assuming gun isn't in a different source file, then I would expect it to be inlined and unrolled exactly the same as fun (which, being a template function, HAS to be visible in this compile unit)

If you want to be a little more explicit, you can use Duff's Device which uses switch case fallthrough to unroll loops. I can't speak to how well it works in practice, though. I would imagine, however, that if you can hint to the compiler to unroll it instead, that would be faster.

Compilers are also pretty smart, and while they're not infallible their optimization choices are generally better than our own intuition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM