CUDA在单个内核中多个动态分配的共享阵列

Question

I have the following problem. 我有以下问题。 I am trying to divide a shared array into smaller arrays and then use these arrays in other device functions. 我试图将共享数组分成较小的数组，然后在其他设备功能中使用这些数组。 In my kernel function I do, 在我的内核函数中，

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {
  //set up shared memory block
  extern __shared__ vec3f share[];
  vec3f *sh_pos = share;
  vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
  vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
  vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
  //index by 2d threadidx's
  unsigned int index = (block_x * blockDim.x + threadIdx.x) + blockDim.x * gridDim.x * (block_y * blockDim.y + threadIdx.y);
  sh_pos[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].position();
  sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].velocity();
  sh_pressure[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].pressure();
  sh_density[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].density();
  __syncthreads();
  d_force_pressure(oldParticles[arr_pos],c_kernel_support);
  __syncthreads();
  }
}

As far as I can tell, all the "sh_" arrays get filled with zeros and not the values that I want. 据我所知，所有“ sh_”数组都填充有零而不是我想要的值。 I can't tell what I am doing wrong. 我不知道我在做什么错。 Note that vec3f is vector of floats just like the float3 datatype. 注意vec3f是float的向量，就像float3数据类型一样。 Also, I didn't think I could mix in floats for density and pressure so I just made them vectors and am using a single component. 另外，我不认为我可以混合使用浮子来获得密度和压力，因此我只是将它们作为矢量并使用了单个组件。 Then, for example my d_force_pressure function is, 然后，例如我的d_force_pressure函数是

__device__ void d_force_pressure(particle& d_particle, float h) {
  extern __shared__ vec3f share[];
  vec3f *sh_pos = share;
  vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
  vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
  vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
  for (int i = 0; i < blockDim.x * blockDim.y; i++) {
    vec3f diffPos = d_particle.position() - sh_pos[i];
    d_particle.force() += GradFuncion(diffPos,h) * -1.0 * c_particle_mass *  (d_particle.pressure()+sh_pressure[i].x)/(2.0*sh_density[i].x);
  }  
 }

After calls to this function I get NaNs since I am dividing by zero ( sh_density[i].x is as far as I can tell, 0). 调用此函数后，由于我被零除（ sh_density[i].x据我所知为0），因此得到NaN。 Also is this in general, the correct way to load shared memory? 同样，这也是加载共享内存的正确方法吗？

Kernel is called by 内核由

dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);
int sharedMemSize = block.x*block.y*4*sizeof(vec3f);
force_kernel<<< grid,block,sharedMemSize  >>>(particle_ptrs[1],particle_ptrs[0],time_step);

Answer 1

This is kind of a follow-up answer. 这是一种后续答案。

As per the comments made by @RobertCrovella, I went on to run cuda-memcheck. 根据@RobertCrovella的评论，我继续运行cuda-memcheck。 Believe it or not, this actually showed no errors. 信不信由你，这实际上没有显示任何错误。 However, when I changed a constant in my code (that controls the sizing of some arrays) cuda-memcheck showed errors related to the question posted here write error . 但是，当我在代码中更改一个常量（控制某些数组的大小）时，cuda-memcheck显示与此处发布的问题相关的错误write error 。 This made me re-check the way I was filling the shared arrays. 这使我重新检查了填充共享阵列的方式。 Basically what needed changing was the 基本上需要改变的是

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {

to 至

for (int block_x = 0; block_x < gridDim.x; block_x++) {
  for (int block_y = 0; block_y < gridDim.y; block_y++) {

I believe that this then gives the right position for the index variable. 我相信这样可以为index变量提供正确的位置。 I basically learned that whenever you are using shared memory and notice things running slow, its a good idea to use cuda-memcheck. 我基本上了解到，每当您使用共享内存并注意到运行缓慢时，使用cuda-memcheck是一个好主意。

Answer 2

I indicated in your previous question that you don't want to do this: 我在您先前的问题中指出您不想这样做：

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);

you want to do this: 您想这样做：

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x))), (int)ceil(sqrt(float(max_particles)) / (float(block.y))), 1);

The x grid direction should be scaled by the threadblock x dimension, not the threadblock x dimension * threadblock y dimension. x网格方向应按线程块x尺寸而不是线程块x尺寸*线程块y尺寸缩放。 However the code I posted in my previous answer also had this error, even though I pointed it out in the comments, I forgot to fix it. 但是，即使我在注释中指出了该错误，但我在上一个答案中发布的代码也存在此错误，但我忘记进行修复。

Furthermore, this indexing doesn't look right to me: 此外，这种索引在我看来并不正确：

sh_velocity[blockDim.x * threadIdx.x + threadIdx.y]

I think it should be: 我认为应该是：

sh_velocity[blockDim.x * threadIdx.y + threadIdx.x]

You have several examples of that. 您有几个例子。

You haven't posted a complete executable. 您尚未发布完整的可执行文件。 There certainly may be more issues than the ones I've pointed out above. 当然，可能有比我上面指出的更多的问题。 If I have to go through all the vec3f -> float3 conversion work I did in your last question, well, someone else can help you then. 如果我必须完成我在上一个问题中所做的所有vec3f-> float3转换工作，那么其他人可以为您提供帮助。 If you write a simple reproducer that doesn't depend on a bunch of code I don't have, I can try to help further. 如果您编写的简单复制器不依赖我没有的一堆代码，那么我可以尝试进一步提供帮助。 More than likely, if you do that, you'll discover the problem yourself. 如果您这样做，很有可能自己发现问题。

Have you put cuda error checking into your code like I suggested in my last answer? 您是否像我在上一个答案中所建议的那样将cuda错误检查放入代码中？

You might also want to run your code through cuda-memcheck: 您可能还想通过cuda-memcheck运行代码：

cuda-memcheck ./mycode

CUDA在单个内核中多个动态分配的共享阵列

问题描述

2 个解决方案

解决方案1
1 2013-05-26 05:25:02

解决方案2
0 已采纳 2013-05-25 14:30:00

CUDA在单个内核中多个动态分配的共享阵列

问题描述

2 个解决方案

解决方案1 1 2013-05-26 05:25:02

解决方案2 0 已采纳 2013-05-25 14:30:00

解决方案1
1 2013-05-26 05:25:02

解决方案2
0 已采纳 2013-05-25 14:30:00