CUDA共享内存编程不起作用

Question

all: 所有：

I am learning how shared memory accelerates the GPU programming process. 我正在学习共享内存如何加速GPU编程过程。 I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors. 我正在使用下面的代码来计算每个元素的平方值加上其左侧和右侧邻居的平均值的平方值。 The code runs, however, the result is not as expected. 该代码运行，但是，结果不符合预期。

The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162; 打印出的前10个结果是0、1、2、3、4、5、6、7、8、9，而我期望的结果是25、2、8、18、32、50、72、98、128、162；

The code is as follows, with the reference to here ; 代码如下，参考此处；

Would you please tell me which part goes wrong? 您能告诉我哪一部分出了问题吗？ Your help is very much appreciated. 非常感激你的帮助。

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>

const int N=1024;

 __global__ void compute_it(float *data)
 {
 int tid = threadIdx.x;
 __shared__ float myblock[N];
 float tmp;

 // load the thread's data element into shared memory
 myblock[tid] = data[tid];

 // ensure that all threads have loaded their values into
 // shared memory; otherwise, one thread might be computing
 // on unitialized data.
 __syncthreads();

 // compute the average of this thread's left and right neighbors
 tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
 // square the previousr result and add my value, squared
 tmp = tmp*tmp + myblock[tid]*myblock[tid];

 // write the result back to global memory
 data[tid] = myblock[tid];
 __syncthreads();
  }

int main (){

char key;

float *a;
float *dev_a;

a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));

for (int i=0; i<N; i++){
a [i] = i;
}


cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);

compute_it<<<N,1>>>(dev_a);

cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);


for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}

std::cin>>key;

free (a);
free (dev_a);

Answer 1

One of the most immediate problems in your kernel code is this: 内核代码中最直接的问题之一是：

data[tid] = myblock[tid];

I think you probably meant this: 我认为您可能是这样说的：

data[tid] = tmp;

In addition, you're launching 1024 blocks of one thread each. 此外，您正在启动1024个块，每个块一个线程。 This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.) 这不是使用GPU的特别有效的方法，这意味着每个线程块中的tid变量均为0（并且只有0，因为每个线程块只有一个线程。）

There are many problems with this approach, but one immediate problem will be encountered here: 这种方法有很多问题，但是这里会遇到一个直接的问题：

tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;

Since tid is always zero, and therefore no other values in your shared memory array ( myblock ) get populated, the logic in this line cannot be sensible. 由于tid始终为零，因此不会填充共享内存数组（ myblock ）中的其他值，因此此行中的逻辑不明智。 When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp , but myblock[1023] never gets populated with anything. 当tid为零时，您要为tmp的分配中的第一项选择myblock[N-1] ，但是myblock[1023]永远不会填充任何东西。

It seems that you don't understand various CUDA hierarchies: 似乎您不了解各种CUDA层次结构：

a grid is all threads associated with a kernel launch 网格是与内核启动关联的所有线程
a grid is composed of threadblocks 网格由线程块组成
each threadblock is a group of threads working together on a single SM 每个线程块是在单个SM上一起工作的一组线程
the shared memory resource is a per-SM resource , not a device-wide resource 共享内存资源是每个SM资源 ，而不是设备范围的资源
__synchthreads() also operates on threadblock basis (not device-wide) __synchthreads()也基于线程块运行（不在设备范围内）
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid. threadIdx.x是一个内置变量，可为线程块内的所有线程提供唯一的线程ID，但不能跨网格全局提供。

Instead you should break your problem into groups of reasonable-sized threadblocks (ie more than one thread). 相反，您应该将您的问题分解为合理大小的线程块（即多个线程）。 Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. 然后，每个线程块将能够以大致与您概述的方式相同的行为。 You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock. 然后，您需要对每个线程块的起点和终点（在您的数据中）的行为进行特殊处理。

You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code. 您也没有进行建议的正确cuda错误检查，尤其是在遇到CUDA代码遇到问题时。

If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters: 如果进行更改，我首先在内核代码中指出，然后反转块和网格内核启动参数的顺序：

compute_it<<<1,N>>>(dev_a);

As indicated by Kristof, you will get something that comes close to what you want, I think. 正如克里斯托夫（Kristof）所说，您会得到接近您想要的东西。 However you will not be able to conveniently scale that beyond N=1024 without other changes to your code. 但是，如果不对代码进行其他更改，将无法方便地将其扩展到N = 1024以上。

This line of code is also not correct: 这行代码也不正确：

free (dev_a);

Since dev_a was allocated on the device using cudaMalloc you should free it like this: 由于dev_a是使用cudaMalloc在设备上分配的，因此您应该像这样释放它：

cudaFree (dev_a);

Answer 2

Since you have only one thread per block, your tid will always be 0. 由于每个块只有一个线程，因此您的tid始终为0。

Try launching the kernel this way: compute_it<<<1,N>>>(dev_a); 尝试以这种方式启动内核：compute_it <<< 1，N >>>（dev_a）;

instead of compute_it<<>>(dev_a); 而不是compute_it << >>（dev_a）;

CUDA共享内存编程不起作用

问题描述

2 个解决方案

解决方案1
3 已采纳 2013-11-08 17:37:35

解决方案2
1 2013-11-08 18:14:30

CUDA共享内存编程不起作用

问题描述

2 个解决方案

解决方案1 3 已采纳 2013-11-08 17:37:35

解决方案2 1 2013-11-08 18:14:30

解决方案1
3 已采纳 2013-11-08 17:37:35

解决方案2
1 2013-11-08 18:14:30