what is the optimal global work size (OpenCL)?

Question

I have found some questions on this issue in stackoverflow. But I still want to ask, just in case if experts have found new theories and explanations on this issue.

I tested and found that for a basic 1D kernel:

// global size = {1024*1024, 1, 1}
// local size = {32, 1, 1}
// x, y, z are float
int i = get_global_id(0);
x[i] = y[i] + z[i];

will be much slower than a 2D kernel:

int i = get_global_id(0);
int j = get_global_id(1);
int index = i + j * width;
x[index] = y[index] + z[index];

Can anyone explain this for me, thanks.

Answer 1

Because you are accessing int sized data, in the 1D case the nearby threads will be accessing the same cache lines. So when they miss out of the cache a whole bunch of threads will be waiting on the same line to be brought into the cache. But in the 2d case, if your graphics device is walking the thread dispatches in y-major direction, you'll be spreading out the accesses. So you have more parallel cache fills going on, and when you wrap around the y-axis, the cache lines are already loaded - so they don't stall anymore.

You can verify this by flipping the array indexing. Also if you increase the array size enough that it can't fit in the cache regardless for tiling, there shouldn't be any difference between the dimensions you use.

what is the optimal global work size (OpenCL)?

Question

1 answers

solution1
2 2015-08-06 17:31:18

what is the optimal global work size (OpenCL)?

Question

1 answers

solution1 2 2015-08-06 17:31:18

solution1
2 2015-08-06 17:31:18