简体   繁体   English

CUDA块的最大数量?

[英]Maximum number of CUDA blocks?

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). 我想在CUDA中实现一个算法,该算法接受大小为N的输入,并使用N ^ 2个线程来执行它(这是特定算法单词的方式)。 I've been asked to make a program that can handle up to N = 2^10. 我被要求制作一个程序,最多可以处理N = 2 ^ 10。 I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. 我认为,对于我的系统,给定的线程块最多可以具有512个线程,但是对于N = 2 ^ 10,拥有N ^ 2线程将意味着拥有N ^ 2/1 512 = 2 ^ 20/512块。 I read at this link ( http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf ) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)". 我在此链接( http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf )上看到,您的块数“可以高达65,535(或更大的2 ^ 31-1) ”。

My questions are: 我的问题是:

1) How do I find the actual maximum number of blocks? 1)如何找到实际的最大块数? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers. 我不确定引号^^表示“ 65,535(或更大的2 ^ 31-1)”的含义,因为它们显然是完全不同的数字。

2) Is it possible to run an algorithm that requires 2^20 / 512 threads? 2)是否可以运行需要2 ^ 20/512个线程的算法?

3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? 3)如果我需要的线程数(2 ^ 20/512)大于CUDA可以提供的线程数,会发生什么? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing? 它只是填充所有可用线程,然后在完成计算后将这些线程重新分配给其他等待的任务吗?

4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value? 4)如果要在每个块中使用最大线程数,是否应该将线程数设置为512,例如<<<number, 512>>> ,还是使用dim3dim3

If you can provide any insight into any of these ^^ questions, I'd appreciate it. 如果您能对这些^^问题中的任何一个提供任何见解,我们将不胜感激。

  1. How do I find the actual maximum number of blocks? 如何找到实际的最大块数? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers. 我不确定引号^^表示“ 65,535(或更大的2 ^ 31-1)”的含义,因为它们显然是完全不同的数字。

Read the relevant documentation , or build and run the devicequery utility. 阅读相关文档 ,或构建并运行devicequery实用程序。 But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). 但无论哪种情况,限制都远大于2048(2 ^ 20/512等于2048)。 Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks. 还请注意,当前所有受支持的硬件上的块大小限制为每个块1024个线程,而不是512个线程,因此您可能需要少至1024个块。

  1. Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]? 是否可以运行需要2 ^ 20/512个线程的算法?

Yes

  1. If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens? 如果我需要的线程数量大于CUDA可以提供的数量,那会发生什么?

Nothing. 没有。 A runtime error is emitted. 发出运行时错误。

  1. Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing? 它只是填充所有可用线程,然后在完成计算后将这些线程重新分配给其他等待的任务吗?

No. You would have to explicitly implement such a scheme yourself. 否。您必须自己明确实现这种方案。

  1. If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value? 如果要在每个块中使用最大线程数,是否应该将线程数设置为512,例如<<<number, 512>>> ,还是使用dim3值有好处?

There is no difference. 没有区别。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM