No performance gain with transpose of large 2d Matrix using Loop tiling

Question

Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm( http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf )

#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];

void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      a[i][j] = b[j][i];
}

void testTiledTranspose(){
    int i, j, k, l;
    b[0][9999] = 1.0;
    int blocksize = 16;
    for (i=0; i<SIZE; i+= blocksize) {
        for (j=0; j<SIZE; j+=blocksize) {
            for (int ii = i;ii <i + blocksize; ++ii) {
                for (int jj = j; jj < j + blocksize; ++jj) {
                    a[ii][jj] = b[jj][ii];
                }

            }
        }   
    }  
}

int main()
{
    struct timeval t1, t2;
    /*
      gettimeofday(&t1, NULL);
      testNormalTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Normal transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
    */
      gettimeofday(&t1, NULL);
      testTiledTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Tiled transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
      printf("%f\n", a[9999][0]);
}

Answer 1

Loop tiling helps in case the data is being reused . If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.

Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (ie a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.

So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:

the element you are accessing now is on the same cache line with an element you were accessing previously AND
that cache line is still available.

So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the blocksize and 16 you have chosen in the example might work better on one system and worse on the other.

Here are the results from my computer for different power of 2 block sizes and SIZE 4096 :

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              32052765 ns   32051761 ns         21
tiled_transpose_2d/2      22246701 ns   22245867 ns         31
tiled_transpose_2d/4      16912984 ns   16912487 ns         41
tiled_transpose_2d/8      16284471 ns   16283974 ns         43
tiled_transpose_2d/16     16604652 ns   16604149 ns         42
tiled_transpose_2d/32     23661431 ns   23660226 ns         29
tiled_transpose_2d/64     32260575 ns   32259564 ns         22
tiled_transpose_2d/128    32107778 ns   32106793 ns         22
fixed_tile_transpose_2d   16735583 ns   16729876 ns         41

As you can see the version with blocksize 8 works the best for me and it almost double the performance.

Here are the results for SIZE 4131 and power of 3 block sizes:

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              29875351 ns   29874381 ns         23
tiled_transpose_2d/3      30077471 ns   30076517 ns         23
tiled_transpose_2d/9      20420423 ns   20419499 ns         35
tiled_transpose_2d/27     13470242 ns   13468992 ns         51
tiled_transpose_2d/81     11318953 ns   11318646 ns         61
tiled_transpose_2d/243    10229250 ns   10228884 ns         65
fixed_tile_transpose_2d   10217339 ns   10217066 ns         67

Regarding 16384 size issue. I can not reproduce it, ie I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...

No performance gain with transpose of large 2d Matrix using Loop tiling

Question

1 answers

solution1
2 2017-09-08 16:29:17

No performance gain with transpose of large 2d Matrix using Loop tiling

Question

1 answers

solution1 2 2017-09-08 16:29:17

solution1
2 2017-09-08 16:29:17