简体   繁体   中英

Loop interchange versus Loop tiling

Which of these optimizations is better and in what situation? Why?

Intuitively, I am getting the feeling that loop tiling will in general be a better optimization.

What about for the below example? Assume a cache which can only store about 20 elements in it at any time.

Original Loop:
for(int i = 0; i < 10; i++)
{
    for(int j = 0; j < 1000; j++)
    {
        a[i] += a[i]*b[j];
    }
}

Loop Interchange:
for(int i = 0; i < 1000; i++)
{
    for(int j = 0; j < 10; j++)
    {
        a[j] += a[j]*b[i];
    }
}

Loop Tiling:
for(int k = 0; k < 1000; k += 20)
{
    for(int i = 0; i < 10; i++)
    {
        for(int j = k; j < min(1000, k+20); j++)
        {
            a[i] += a[i]*b[j];
        }
    }
}

The first two cases you are exposing in your question are about the same. Things would really change in the following two cases:

CASE 1:

for(int i = 0; i < 10; i++)
{
    for(int j = 0; j < 1000; j++)
    {
        b[i] += a[i]*a[j];
    }
}

Here you are accessing the matrix "a" as follows: a[0]*a[0], a[0]*a 1 , a[0]*a[2],.... In most architectures, matrix structures are stored in memory like: a[0]*a[0], a 1 *a[0], a[2]*a[0] (first column of first row followed by second column of first raw,....). Imagine your cache only could store 5 elements and your matrix is 6x6. The first "pack" of elements that would be stored in cache would be a[0]*a[0] to a[4]*a[0]. Your first acces would cause no cache miss so a[0][0] is stored in cache but the second yes!! a 0 is not stored in cache! Then the OS would bring to cache the pack of elements a 0 to a 4 . Then you do the third acces: a[0]*a[2] wich is out of cache again. Another cache miss!

As you can colcude, case 1 is not a good solution for the problem. It causes lots of cache misses that we can avoid changing the code for the following:

CASE 2:

for(int i = 0; i < 10; i++)
    {
        for(int j = 0; j < 1000; j++)
        {
            b[i] += a[i]*a[j];
        }
    }

Here, as you can see, we are accessing the matrix as it's stored in memory. Consequently it's much better (faster) than case 1.

About the third code you posted about loop tiling, loop tiling and also loop unrolling are optimizations that in most cases the compiler does automaticaly. Here's a very interesting post in stackoverflow explaining these two techniques;

Hope it helps! (sorry about my english, I'm not a native speaker)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM