简体   繁体   中英

Maximum number of CUDA blocks?

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link ( http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf ) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".

My questions are:

1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.

2) Is it possible to run an algorithm that requires 2^20 / 512 threads?

3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?

4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value?

If you can provide any insight into any of these ^^ questions, I'd appreciate it.

  1. How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.

Read the relevant documentation , or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.

  1. Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?

Yes

  1. If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?

Nothing. A runtime error is emitted.

  1. Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?

No. You would have to explicitly implement such a scheme yourself.

  1. If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value?

There is no difference.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM