简体   繁体   中英

Improving cache performance while iterating through a simple 2D array?

I have been trying to think of a way to rewrite the code below to improve cache performance ( by reducing misses in cache) in the array.

I am aware that the array is stored in memory row by row (sequentially), so ary[0][0], ary[0][1], ary[0][2],....ary[1][0], ary[1][1], ary[1][2]... ary[50][0], ary[50][1]...ary[50][50]. However, I am uncertain how to use this info to help me figure out how to modify the loop to improve cache performance.

for (c = 0; c < 50; c++)
    for (d = 0; d < 50; d++)
        ary[d][c] = ary[d][c] + 1;

If you want to access all the cells of a row at once, just inverse the two loops:

for (d = 0; d < 50; d++)
    for (c = 0; c < 50; c++)
        ary[d][c] = ary[d][c] + 1;

Or even

for (d = 0; d < 50; d++)
    int[] array = ary[d];
    for (c = 0; c < 50; c++)
        array[c] = array[c] + 1;

But I doubt it has any significant impact, or even any impact at all, especially on a so tiny array. Make your code simple and readable. Don't pre-optimize.

Swap the loop order. You're accessing arr[1][0] right after arr[0][0] . arr[1][0] is much farther away, while arr[0][1] is at the next address.

You want to minimize the number of cache misses to improve performance. Each cache miss results in memory access and loading of a new block to the cache. This block contains not just the value you need but also additional adjacent values from the memory. You need to make use of the locality principle, ie use as much values from each memory access as you can. Like you mentioned in your observation, the array is stored row by row in the memory, so traversing the array in sequential manner will minimize the number of cache misses. Getting back to your code, either swap the loop order:

for (d = 0; d < 50; d++)
    for (c = 0; c < 50; c++)
        ary[d][c] = ary[d][c] + 1;

or swap the indices in the calculation:

for (c = 0; c < 50; c++)
    for (d = 0; d < 50; d++)
        ary[c][d] = ary[c][d] + 1;

You can even treat the 2D array as a 1D array of 50*50 size and just use a single for loop to scan it from the beginning to the end.

You probably don't need to do anything, other than swapping the loop, because caches are designed to exploit the locality of reference in code on its own, which means it will cache the first element along with the few following elements (spacial locality) from the array and will keep them in cache for a while (temporal locality).

However, some compilers let you control caching, for example gcc has the __builtin_prefetch which lets you control which data should be prefetched and whether it should be left in cache or not.

— Built-in Function: void __builtin_prefetch (const void *addr, rw, locality)

This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions are generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

And the manual gives this example:

for (i = 0; i < n; i++)
{
  a[i] = a[i] + b[i];
  __builtin_prefetch (&a[i+j], 1, 1);
  __builtin_prefetch (&b[i+j], 0, 1);
  /* ... */
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM