简体繁体 English

CUDA块的最大数量？

[英]Maximum number of CUDA blocks?

原文 2019-04-20 02:30:28 7 1 cuda

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). 我想在CUDA中实现一个算法，该算法接受大小为N的输入，并使用N ^ 2个线程来执行它（这是特定算法单词的方式）。 I've been asked to make a program that can handle up to N = 2^10. 我被要求制作一个程序，最多可以处理N = 2 ^ 10。 I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. 我认为，对于我的系统，给定的线程块最多可以具有512个线程，但是对于N = 2 ^ 10，拥有N ^ 2线程将意味着拥有N ^ 2/1 512 = 2 ^ 20/512块。 I read at this link ( http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf ) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)". 我在此链接（ http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf ）上看到，您的块数“可以高达65,535（或更大的2 ^ 31-1） ”。

My questions are: 我的问题是：

1) How do I find the actual maximum number of blocks? 1）如何找到实际的最大块数？ I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers. 我不确定引号^^表示“ 65,535（或更大的2 ^ 31-1）”的含义，因为它们显然是完全不同的数字。

2) Is it possible to run an algorithm that requires 2^20 / 512 threads? 2）是否可以运行需要2 ^ 20/512个线程的算法？

3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? 3）如果我需要的线程数（2 ^ 20/512）大于CUDA可以提供的线程数，会发生什么？ Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing? 它只是填充所有可用线程，然后在完成计算后将这些线程重新分配给其他等待的任务吗？

4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value? 4）如果要在每个块中使用最大线程数，是否应该将线程数设置为512，例如<<<number, 512>>> ，还是使用dim3值dim3 ？

If you can provide any insight into any of these ^^ questions, I'd appreciate it. 如果您能对这些^^问题中的任何一个提供任何见解，我们将不胜感激。

1 个解决方案

How do I find the actual maximum number of blocks? 如何找到实际的最大块数？ I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers. 我不确定引号^^表示“ 65,535（或更大的2 ^ 31-1）”的含义，因为它们显然是完全不同的数字。

Read the relevant documentation , or build and run the devicequery utility. 阅读相关文档，或构建并运行devicequery实用程序。 But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). 但无论哪种情况，限制都远大于2048（2 ^ 20/512等于2048）。 Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks. 还请注意，当前所有受支持的硬件上的块大小限制为每个块1024个线程，而不是512个线程，因此您可能需要少至1024个块。

Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]? 是否可以运行需要2 ^ 20/512个线程的算法？

Yes 是

If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens? 如果我需要的线程数量大于CUDA可以提供的数量，那会发生什么？

Nothing. 没有。 A runtime error is emitted. 发出运行时错误。

Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing? 它只是填充所有可用线程，然后在完成计算后将这些线程重新分配给其他等待的任务吗？

No. You would have to explicitly implement such a scheme yourself. 否。您必须自己明确实现这种方案。

If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>> , or is there an advantage to using a dim3 value? 如果要在每个块中使用最大线程数，是否应该将线程数设置为512，例如<<<number, 512>>> ，还是使用dim3值有好处？

There is no difference. 没有区别。