使用2D块网格减少CUB

Question

I'm trying to make a sum using the CUB reduction method. 我正在尝试使用CUB减少方法求和。

The big problem is: I'm not sure how to return the values of each block to the Host when using 2-dimensional grids. 最大的问题是：使用二维网格时，我不确定如何将每个块的值返回给主机。

#include <iostream>
#include <math.h>
#include <cub/block/block_reduce.cuh>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <iomanip>

#define nat 1024
#define BLOCK_SIZE 32
#define GRID_SIZE 32

struct frame
{
   int  natm;
   char  title[100];
   float conf[nat][3];
};

using namespace std;
using namespace cub;

__global__
void add(frame* s, float L, float rc, float* blocksum)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
int j = blockDim.y*blockIdx.y + threadIdx.y;

float E=0.0, rij, dx, dy, dz;

// Your calculations first so that each thread holds its result
  dx = fabs(s->conf[j][0] - s->conf[i][0]);
  dy = fabs(s->conf[j][1] - s->conf[i][1]);
  dz = fabs(s->conf[j][2] - s->conf[i][2]);
  dx = dx - round(dx/L)*L;
  dy = dy - round(dy/L)*L;
  dz = dz - round(dz/L)*L;

   rij = sqrt(dx*dx + dy*dy + dz*dz);

  if ((rij <= rc) && (rij > 0.0))
    {E =  (4*((1/pow(rij,12))-(1/pow(rij,6))));}

//  E = 1.0;
__syncthreads();
// Block wise reduction so that one thread in each block holds sum of thread results

typedef cub::BlockReduce<float, BLOCK_SIZE, BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

__shared__ typename BlockReduce::TempStorage temp_storage;

float aggregate = BlockReduce(temp_storage).Sum(E);

if (threadIdx.x == 0 && threadIdx.y == 0)
    blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

}

int main(void)
{
  frame  * state = (frame*)malloc(sizeof(frame));

  float *blocksum = (float*)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));

  state->natm = nat; //inicializando o numero de atomos;

  char name[] = "estado1";
  strcpy(state->title,name);

  for (int i = 0; i < nat; i++) {
    state->conf[i][0] = i;
    state->conf[i][1] = i;
    state->conf[i][2] = i;
  }

  frame * d_state;
  float *d_blocksum;

  cudaMalloc((void**)&d_state, sizeof(frame));

  cudaMalloc((void**)&d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)));

  cudaMemcpy(d_state, state, sizeof(frame),cudaMemcpyHostToDevice);


  dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
  dim3 gridBlock(GRID_SIZE,GRID_SIZE);

  add<<<gridBlock,dimBlock>>>(d_state, 3000, 15, d_blocksum);

  cudaError_t status =  cudaMemcpy(blocksum, d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)),cudaMemcpyDeviceToHost);

  float Etotal = 0.0;
  for (int k = 0; k < GRID_SIZE*GRID_SIZE; k++){
       Etotal += blocksum[k];
  }
 cout << endl << "energy: " << Etotal << endl;

  if (cudaSuccess != status)
  {
    cout << cudaGetErrorString(status) << endl;
  }

 // Free memory
  cudaFree(d_state);
  cudaFree(d_blocksum);

  return cudaThreadExit();
}

What is happening is that if the value of GRID_SIZE is the same as BLOCK_SIZE , as written above. 正在发生的事情是，如果价值GRID_SIZE是一样的BLOCK_SIZE ，因为上面写的。 The calculation is correct. 计算正确。 But if I change the value of GRID_SIZE , the result goes wrong. 但是，如果我更改GRID_SIZE的值，结果将出错。 Which leads me to think that the error is in this code: 这使我认为该代码中包含错误：

blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

The idea here is to return a 1D array, which contains the sum of each block. 这里的想法是返回一个一维数组，其中包含每个块的总和。

I do not intend to change the BLOCK_SIZE value, but the value of GRID_SIZE depends on the system I'm looking at, I intend to use values greater than 32 (always multiples of that). 我不打算改变BLOCK_SIZE价值，但价值GRID_SIZE取决于我期待在系统上，我打算用值大于32（的总是倍数）。

I looked for some example that use 2D grid with CUB, but did not find. 我寻找了一些将2D网格与CUB一起使用的示例，但没有找到。

I really new in CUDA program, maybe I'm making a mistake. 我真的是CUDA程序的新手，也许我在弄错。

edit : I put the complete code. 编辑：我把完整的代码。 For comparison, when I calculate these exact values for a serial program, it gives me energy: -297,121 为了进行比较，当我为串行程序计算这些精确值时，它给了我能量：-297,121

Answer 1

Probably the main issue is that your output indexing is not correct. 可能的主要问题是您的输出索引不正确。 Here's a reduced version of your code demonstrating correct results for arbitrary GRID_SIZE : 这是代码的简化版本，展示了任意GRID_SIZE正确结果：

$ cat t1360.cu
#include <stdio.h>
#include <cub/cub.cuh>
#define BLOCK_SIZE 32
#define GRID_SIZE 25
__global__
void add(float* blocksum)
{
   float E = 1.0;
  // Block wise reduction so that one thread in each block holds sum of thread results
    typedef cub::BlockReduce<float, BLOCK_SIZE, cub::BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

    __shared__ typename BlockReduce::TempStorage temp_storage;
    float aggregate = BlockReduce(temp_storage).Sum(E);
    __syncthreads();
    if (threadIdx.x == 0 && threadIdx.y == 0)
        blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;
}

int main(){

  float *d_result, *h_result;
  h_result = (float *)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));
  cudaMalloc(&d_result, GRID_SIZE*GRID_SIZE*sizeof(float));
  dim3 grid  = dim3(GRID_SIZE,GRID_SIZE);
  dim3 block = dim3(BLOCK_SIZE, BLOCK_SIZE);
  add<<<grid, block>>>(d_result);
  cudaMemcpy(h_result, d_result, GRID_SIZE*GRID_SIZE*sizeof(float), cudaMemcpyDeviceToHost);
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) {printf("cuda error: %s\n", cudaGetErrorString(err)); return -1;}
  float result = 0;
  for (int i = 0; i < GRID_SIZE*GRID_SIZE; i++) result += h_result[i];
  if (result != (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE)) printf("mismatch, should be: %f, was: %f\n", (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE), result);
  else printf("Success\n");
  return 0;
}

$ nvcc -o t1360 t1360.cu
$ ./t1360
Success
$

The important change I made to your kernel code was in the output indexing: 我对内核代码进行的重要更改是在输出索引中：

blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;

We want a simulated 2D index into an array that has width and height of GRID_SIZE consisting of one float quantity per point. 我们希望将2D索引模拟成一个数组，该数组的宽度和高度为GRID_SIZE ，每个点包含一个float 。 Therefore the width of this array is given by gridDim.x (not blockDim ). 因此，此数组的宽度由gridDim.x （而不是blockDim ）给定。 The gridDim variable gives the dimensions of the grid in terms of blocks - and this lines up exactly with how our results array is set up. gridDim变量以块为单位给出网格的尺寸-这与结果数组的设置方式完全一致。

Your posted code will fail if GRID_SIZE and BLOCK_SIZE are different (for example, if GRID_SIZE were smaller than BLOCK_SIZE , cuda-memcheck will show illegal accesses, and if GRID_SIZE is larger than BLOCK_SIZE then this indexing error will result in blocks overwriting each other's values in the output array) because of this mixup between blockDim and gridDim . 如果GRID_SIZE和BLOCK_SIZE不同（例如，如果GRID_SIZE小于BLOCK_SIZE ，则cuda-memcheck将显示非法访问，并且GRID_SIZE大于BLOCK_SIZE则您发布的代码将失败，此索引错误将导致块互相覆盖对方的值输出数组），因为blockDim和gridDim之间blockDim这种混合。

Also note that float operations typically only have around 5 decimal digits of precision. 另请注意， float运算通常仅具有约5个十进制数字的精度。 So small differences in the 5th or 6th decimal place may be attributable to order of operations differences when doing floating-point arithmetic . 因此，在进行小数点运算时，第5位或第6位小数位的差异可能归因于运算顺序的差异。 You can prove this to yourself by switching to double arithmetic. 您可以通过切换到double精度算术来证明这一点。

使用2D块网格减少CUB

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-06-02 00:32:03

使用2D块网格减少CUB

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-06-02 00:32:03

解决方案1
0 已采纳 2018-06-02 00:32:03