I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link ( http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf ) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>
, or is there an advantage to using a dim3
value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.
- How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
Read the relevant documentation , or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
- Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
- If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
- Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
- If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like
<<<number, 512>>>
, or is there an advantage to using a dim3 value?
There is no difference.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.