简体   繁体   中英

CUDA - How to return results of unknown size

I'm passing a large 2D array (in C) to the device and determining all possible combinations. For example:

A = 
id  val1 val2
1   100  200
2   400  800

Combination = 
id1  id2  sumval1  sumval2
1    2    500      1000

Because of the size of the original array, storing and returning all possible combinations would not be possible. I would like to return all combinations where sumval1 > 500 and sumval2 > 1000.

How can I return just this subset of combinations to the host to be written to a file; given that I won't know how many combinations meet the conditions?

Some possible approaches:

  1. Allocate (from the host) whatever space you have left in GPU memory for a buffer. If you exceed that, you weren't going to be able to pass all the combinations back in a single transfer anyway. (Which may lead you to use the solution proposed by mtk99).
  2. Dynamically allocate space as you need it on the device using in-kernel malloc . At the completion of your combination-creation, collect all the individual combinations into a single buffer created with malloc . Then pass the total size of this buffer, and the pointer to this buffer, back to the host. The host then allocates a new buffer of that size using cudaMalloc , and launches a kernel to copy the data from the buffer created with malloc to the buffer created with cudaMalloc . At the completion of this copy-kernel, the host can transfer the data back to the host from the buffer created with cudaMalloc .

I would suggest that 1 is probably the best approach without knowing anything else about what you are trying to do. In kernel malloc is not particularly fast when allocating large numbers of small allocations. Also, when using in-kernel malloc , note the default size limitation (8MB) which can be increased.

You can page the results:

  • Create a fix result array (let's say Z items).

  • Return not only the results but the point where you stopped (last_id1, last_id2).

  • On the next call pass a new starting point (start_id1, start_id2) based on your last result.

You can use streams in order to keep the GPU loaded.

Based on this, you could even distribute the calculation using several GPUs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM