简体   繁体   English

使用2D块网格减少CUB

[英]CUB reduction using 2D grid of blocks

I'm trying to make a sum using the CUB reduction method. 我正在尝试使用CUB减少方法求和。

The big problem is: I'm not sure how to return the values of each block to the Host when using 2-dimensional grids. 最大的问题是:使用二维网格时,我不确定如何将每个块的值返回给主机。

#include <iostream>
#include <math.h>
#include <cub/block/block_reduce.cuh>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <iomanip>

#define nat 1024
#define BLOCK_SIZE 32
#define GRID_SIZE 32

struct frame
{
   int  natm;
   char  title[100];
   float conf[nat][3];
};

using namespace std;
using namespace cub;

__global__
void add(frame* s, float L, float rc, float* blocksum)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
int j = blockDim.y*blockIdx.y + threadIdx.y;

float E=0.0, rij, dx, dy, dz;

// Your calculations first so that each thread holds its result
  dx = fabs(s->conf[j][0] - s->conf[i][0]);
  dy = fabs(s->conf[j][1] - s->conf[i][1]);
  dz = fabs(s->conf[j][2] - s->conf[i][2]);
  dx = dx - round(dx/L)*L;
  dy = dy - round(dy/L)*L;
  dz = dz - round(dz/L)*L;

   rij = sqrt(dx*dx + dy*dy + dz*dz);

  if ((rij <= rc) && (rij > 0.0))
    {E =  (4*((1/pow(rij,12))-(1/pow(rij,6))));}

//  E = 1.0;
__syncthreads();
// Block wise reduction so that one thread in each block holds sum of thread results

typedef cub::BlockReduce<float, BLOCK_SIZE, BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

__shared__ typename BlockReduce::TempStorage temp_storage;

float aggregate = BlockReduce(temp_storage).Sum(E);

if (threadIdx.x == 0 && threadIdx.y == 0)
    blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

}

int main(void)
{
  frame  * state = (frame*)malloc(sizeof(frame));

  float *blocksum = (float*)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));

  state->natm = nat; //inicializando o numero de atomos;

  char name[] = "estado1";
  strcpy(state->title,name);

  for (int i = 0; i < nat; i++) {
    state->conf[i][0] = i;
    state->conf[i][1] = i;
    state->conf[i][2] = i;
  }

  frame * d_state;
  float *d_blocksum;

  cudaMalloc((void**)&d_state, sizeof(frame));

  cudaMalloc((void**)&d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)));

  cudaMemcpy(d_state, state, sizeof(frame),cudaMemcpyHostToDevice);


  dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
  dim3 gridBlock(GRID_SIZE,GRID_SIZE);

  add<<<gridBlock,dimBlock>>>(d_state, 3000, 15, d_blocksum);

  cudaError_t status =  cudaMemcpy(blocksum, d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)),cudaMemcpyDeviceToHost);

  float Etotal = 0.0;
  for (int k = 0; k < GRID_SIZE*GRID_SIZE; k++){
       Etotal += blocksum[k];
  }
 cout << endl << "energy: " << Etotal << endl;

  if (cudaSuccess != status)
  {
    cout << cudaGetErrorString(status) << endl;
  }

 // Free memory
  cudaFree(d_state);
  cudaFree(d_blocksum);

  return cudaThreadExit();
}

What is happening is that if the value of GRID_SIZE is the same as BLOCK_SIZE , as written above. 正在发生的事情是,如果价值GRID_SIZE是一样的BLOCK_SIZE ,因为上面写的。 The calculation is correct. 计算正确。 But if I change the value of GRID_SIZE , the result goes wrong. 但是,如果我更改GRID_SIZE的值,结果将出错。 Which leads me to think that the error is in this code: 这使我认为该代码中包含错误:

blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

The idea here is to return a 1D array, which contains the sum of each block. 这里的想法是返回一个一维数组,其中包含每个块的总和。

I do not intend to change the BLOCK_SIZE value, but the value of GRID_SIZE depends on the system I'm looking at, I intend to use values greater than 32 (always multiples of that). 我不打算改变BLOCK_SIZE价值,但价值GRID_SIZE取决于我期待在系统上,我打算用值大于32(的总是倍数)。

I looked for some example that use 2D grid with CUB, but did not find. 我寻找了一些将2D网格与CUB一起使用的示例,但没有找到。

I really new in CUDA program, maybe I'm making a mistake. 我真的是CUDA程序的新手,也许我在弄错。

edit : I put the complete code. 编辑 :我把完整的代码。 For comparison, when I calculate these exact values for a serial program, it gives me energy: -297,121 为了进行比较,当我为串行程序计算这些精确值时,它给了我能量:-297,121

Probably the main issue is that your output indexing is not correct. 可能的主要问题是您的输出索引不正确。 Here's a reduced version of your code demonstrating correct results for arbitrary GRID_SIZE : 这是代码的简化版本,展示了任意GRID_SIZE正确结果:

$ cat t1360.cu
#include <stdio.h>
#include <cub/cub.cuh>
#define BLOCK_SIZE 32
#define GRID_SIZE 25
__global__
void add(float* blocksum)
{
   float E = 1.0;
  // Block wise reduction so that one thread in each block holds sum of thread results
    typedef cub::BlockReduce<float, BLOCK_SIZE, cub::BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

    __shared__ typename BlockReduce::TempStorage temp_storage;
    float aggregate = BlockReduce(temp_storage).Sum(E);
    __syncthreads();
    if (threadIdx.x == 0 && threadIdx.y == 0)
        blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;
}

int main(){

  float *d_result, *h_result;
  h_result = (float *)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));
  cudaMalloc(&d_result, GRID_SIZE*GRID_SIZE*sizeof(float));
  dim3 grid  = dim3(GRID_SIZE,GRID_SIZE);
  dim3 block = dim3(BLOCK_SIZE, BLOCK_SIZE);
  add<<<grid, block>>>(d_result);
  cudaMemcpy(h_result, d_result, GRID_SIZE*GRID_SIZE*sizeof(float), cudaMemcpyDeviceToHost);
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) {printf("cuda error: %s\n", cudaGetErrorString(err)); return -1;}
  float result = 0;
  for (int i = 0; i < GRID_SIZE*GRID_SIZE; i++) result += h_result[i];
  if (result != (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE)) printf("mismatch, should be: %f, was: %f\n", (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE), result);
  else printf("Success\n");
  return 0;
}

$ nvcc -o t1360 t1360.cu
$ ./t1360
Success
$

The important change I made to your kernel code was in the output indexing: 我对内核代码进行的重要更改是在输出索引中:

blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;

We want a simulated 2D index into an array that has width and height of GRID_SIZE consisting of one float quantity per point. 我们希望将2D索引模拟成一个数组,该数组的宽度和高度为GRID_SIZE ,每个点包含一个float Therefore the width of this array is given by gridDim.x (not blockDim ). 因此,此数组的宽度由gridDim.x (而不是blockDim )给定。 The gridDim variable gives the dimensions of the grid in terms of blocks - and this lines up exactly with how our results array is set up. gridDim变量以块为单位给出网格的尺寸-这与结果数组的设置方式完全一致。

Your posted code will fail if GRID_SIZE and BLOCK_SIZE are different (for example, if GRID_SIZE were smaller than BLOCK_SIZE , cuda-memcheck will show illegal accesses, and if GRID_SIZE is larger than BLOCK_SIZE then this indexing error will result in blocks overwriting each other's values in the output array) because of this mixup between blockDim and gridDim . 如果GRID_SIZEBLOCK_SIZE不同(例如,如果GRID_SIZE小于BLOCK_SIZE ,则cuda-memcheck将显示非法访问,并且GRID_SIZE大于BLOCK_SIZE则您发布的代码将失败,此索引错误将导致块互相覆盖对方的值输出数组),因为blockDimgridDim之间blockDim这种混合。

Also note that float operations typically only have around 5 decimal digits of precision. 另请注意, float运算通常仅具有约5个十进制数字的精度。 So small differences in the 5th or 6th decimal place may be attributable to order of operations differences when doing floating-point arithmetic . 因此,在进行小数点运算时 ,第5位或第6位小数位的差异可能归因于运算顺序的差异 You can prove this to yourself by switching to double arithmetic. 您可以通过切换到double精度算术来证明这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM