全局内核中的CUDA变量

Question

My questions are: 我的问题是：

1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. 1）我是否理解正确，当您在全局内核中声明一个变量时，每个线程将有该变量的不同副本。 That allows you to store some intermediate result in this variable for every thread. 这样，您就可以为每个线程在此变量中存储一些中间结果。 Example: vector c=a+b: 例如：向量c = a + b：

__global__ void addKernel(int *c, const int *a, const int *b)
{
   int i = threadIdx.x;
   int p;
   p = a[i] + b[i];
   c[i] = p;
}

Here we declare intermediate variable p. 在这里，我们声明中间变量p。 But in reality there are N copies of this variable, each one for each thread. 但实际上，此变量有N个副本，每个线程每个副本。

2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? 2）是的，如果我要声明数组，将为每个线程创建N个该数组的副本吗？ And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads. 只要全局内核内部的所有事情都在gpu内存上发生，对于声明的任何变量，您在gpu上就需要N倍的内存，其中N是线程数。

3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. 3）在我当前的程序中，我有35 * 48 = 1680个块，每个块包含32 * 32 = 1024个线程。 Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel? 这是否意味着，在全局内核中声明的任何变量将比在内核之外花费N = 1024 * 1680 = 1 720 320倍？

4) To use shared memory, I need M times more memory for each variable than usually. 4）要使用共享内存，每个变量需要的内存比平时多M倍。 Here M is the number of blocks . 这里M是块数。 Is that true? 真的吗？

Answer 1

1) Yes. 1）是的。 Each thread has a private copy of non-shared variables declared in the function. 每个线程都有一个在函数中声明的非共享变量的私有副本。 These usually go into GPU register memory, though can spill into local memory. 这些通常会进入GPU register存储器，尽管会溢出到local存储器中。

2) , 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. 2）， 3）和4）虽然确实需要该私有内存的许多副本，但这并不意味着您的GPU必须为每个线程一次拥有足够的私有内存。 This is because in hardware, not all threads need to execute simultaneously. 这是因为在硬件中，并非所有线程都需要同时执行。 For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them. 例如，如果您启动N个线程，则可能是一半在给定的时间处于活动状态，而另一半在没有可用资源运行它们之前不会启动。

The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up. 线程使用的资源越多，硬件可以同时运行的资源越少，但这并不限制您可以请求运行的资源数量，因为一旦释放了一些资源，GPU所没有的任何线程都将运行。

This doesn't mean you should go crazy and declare massive amounts of local resources. 这并不意味着您应该发疯并声明大量的本地资源。 A GPU is fast because it is able to run threads in parallel. GPU之所以快速，是因为它能够并行运行线程。 To run these threads in parallel it needs to fit a lot of threads at any given time. 要并行运行这些线程，需要在任何给定时间容纳很多线程。 In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit. 从一般意义上讲，每个线程使用的资源越多，给定时刻处于活动状态的线程就越少，硬件可以利用的并行性就越少。

全局内核中的CUDA变量

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-12-11 18:22:26

全局内核中的CUDA变量

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-12-11 18:22:26

解决方案1
5 已采纳 2014-12-11 18:22:26