Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm( http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf )
#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
for (j=0; j<SIZE; j++)
a[i][j] = b[j][i];
}
void testTiledTranspose(){
int i, j, k, l;
b[0][9999] = 1.0;
int blocksize = 16;
for (i=0; i<SIZE; i+= blocksize) {
for (j=0; j<SIZE; j+=blocksize) {
for (int ii = i;ii <i + blocksize; ++ii) {
for (int jj = j; jj < j + blocksize; ++jj) {
a[ii][jj] = b[jj][ii];
}
}
}
}
}
int main()
{
struct timeval t1, t2;
/*
gettimeofday(&t1, NULL);
testNormalTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Normal transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
*/
gettimeofday(&t1, NULL);
testTiledTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Tiled transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
printf("%f\n", a[9999][0]);
}
Loop tiling helps in case the data is being reused . If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.
Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (ie a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.
So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:
So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the blocksize
and 16 you have chosen in the example might work better on one system and worse on the other.
Here are the results from my computer for different power of 2 block sizes and SIZE 4096
:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 32052765 ns 32051761 ns 21
tiled_transpose_2d/2 22246701 ns 22245867 ns 31
tiled_transpose_2d/4 16912984 ns 16912487 ns 41
tiled_transpose_2d/8 16284471 ns 16283974 ns 43
tiled_transpose_2d/16 16604652 ns 16604149 ns 42
tiled_transpose_2d/32 23661431 ns 23660226 ns 29
tiled_transpose_2d/64 32260575 ns 32259564 ns 22
tiled_transpose_2d/128 32107778 ns 32106793 ns 22
fixed_tile_transpose_2d 16735583 ns 16729876 ns 41
As you can see the version with blocksize
8 works the best for me and it almost double the performance.
Here are the results for SIZE 4131
and power of 3 block sizes:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 29875351 ns 29874381 ns 23
tiled_transpose_2d/3 30077471 ns 30076517 ns 23
tiled_transpose_2d/9 20420423 ns 20419499 ns 35
tiled_transpose_2d/27 13470242 ns 13468992 ns 51
tiled_transpose_2d/81 11318953 ns 11318646 ns 61
tiled_transpose_2d/243 10229250 ns 10228884 ns 65
fixed_tile_transpose_2d 10217339 ns 10217066 ns 67
Regarding 16384 size issue. I can not reproduce it, ie I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.