简体   繁体   中英

OpenCL - improve memory size usage

I'm now working on a project with GPUs that gave me slower results than CPU. The reason is that I'm enqueuing a too small data array as input ( Length = 1024 )

I would like to enqueue more data but I'm stuck because of the memory usage. I'm computing inside my kernel 283 functions, each evaluated on 481 periods.

So in order to get back my results I had to create an array that sizes N(here 1024) x 481 x 283 of doubles (because the 283 functions return double value)

This length is too large. As I want to put more data, the ouput will be 481 x 283 larger and I will reach the GPU's memory limit. I don't know how to use less memory.

This is an example of my kernel function:

PERIODS = 481
data = input at get_global_id(0)
OUTPUT(get_global_id(0), t, x) is the accessor to store the results ( tri-dimensionnal array)

for(int t=0; t < PERIODS; t++)
OUTPUT(get_global_id(0), t, 1) = function1(data, t);
for(int t=0; t < PERIODS; t++)
OUTPUT(get_global_id(0), t, 2) = function2(data, t);
for(int t=0; t < PERIODS; t++)
OUTPUT(get_global_id(0), t, 3) = function3(data, t);

Of course it looks bad but the problem is that my "called" functions sometimes need the value at T=12 or T=24. So I have to compute all periods for each function to be sure the value they need are present inside OUTPUT accessor.

For example : In a 2D problem( data, PERIODS) -- function2 needs the result of function1 at T=4. But work-items are not all synchronized. So maybe the value function2 needs is there, maybe not. The solution was to ensure that by putting for loops around all functions called and going from 2D problem to 1D ( It looks realllly bad, a 2D organization could have been great.. but I didn't find any ways to sync all threads through global memory)

The first idea I had to use less memory is to call 481 times the kernel function with an argument T = t. So the output array will weigh 481 less than now and I could put 481 more data. But to use this solution I have to factorize my for loops which is not really possible I guess. ( Like I said : Because the function2 could need the function1 result at T = 4 for example )

If you have any ideas or solutions, I would be glad to hear them.

I'm now working on a project with GPUs that gave me slower results than CPU. The reason is that I'm enqueuing a too small data array as input ( Length = 1024 )

Assuming your gpu has asynchronoys dma modules, why don't you implement it as pipelined? Upload 1st iteration, compute 1st while uploading 2nd, download 1st while computing 2nd also uploading 3rd, download 2 + compute 3 + upload 4, download 3 + compute 4 + upload 5. This should hide at least reads or writes to gpu.

I would like to enqueue more data but I'm stuck because of the memory usage. I'm computing inside my kernel 283 functions, each evaluated on 481 periods.

So in order to get back my results I had to create an array that sizes N(here 1024) x 481 x 283 of doubles (because the 283 functions return double value)

Assuming recursivity of "function" doesn't span over GB, you could enqueue each request of type

OUTPUT[thread_num], period_num, f_num) = function_f_num(data, period_num);

into a list when you only need it(instead of computing everything). Then, when queue reaches about 1M elements, upload it to gpu. Add fake recursivity (using multiply-copied kernel names with post-fix or some semi-stack structure backing a kernel) to find any elements that needs recursivity. Instead of recursivity, uncomputable elements could be added into a new list, having recursivity in host-side to complete all sub-lists. This should also give software an explicit multi gpu usage advantage since you can compute multiple queues in different gpus. Or, you could simply use host pointer by gpu to use all host-side buffers. Have a look at CL_MEM_USE_HOST_PTR parameter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM