简体   繁体   English

在CUDA中循环3维数组以对其元素求和

[英]Looping over 3 dimensional arrays in CUDA to sum their elements

I'm having some problems understanding how to loop over 3 dimensional arrays with a kernel. 我在理解如何使用内核循环3维数组时遇到一些问题。

This is the code I have so far: 这是我到目前为止的代码:

#include <iostream>
#include <ctime>

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

using namespace std;


int main()
{
// Array properties
const int width = 1;
const int height = 1;
const int depth = 1;

// Declaration of arrays
float h_A[width][height][depth];
float h_B[width][height][depth];
float h_C[width][height][depth] = {{{0}}};

// Fill up arrays
srand(time(0));
for(int i = 0; i < width; i++){
    for(int j = 0; j < height; j++){
        for(int z = 0; z < depth; z++){
            h_A[i][j][z] = rand()%1000;
            h_B[i][j][z] = rand()%1000;
        }
    }
}

// Declaration of device pointers
cudaPitchedPtr d_A, d_B, d_C;

// Allocating memory in GPU
cudaExtent extent = make_cudaExtent(width*sizeof(float),height,depth);
cudaMalloc3D(&d_A, extent);
cudaMalloc3D(&d_B, extent);
cudaMalloc3D(&d_C, extent);

// Copying memory from host to device
cudaMemcpy3DParms p;
p.srcPtr = make_cudaPitchedPtr(&h_A, sizeof(float)*width, height, depth);
p.extent = extent;
p.kind = cudaMemcpyHostToDevice;

p.dstPtr = d_A;
cudaMemcpy3D(&p);
p.dstPtr = d_B;
cudaMemcpy3D(&p);
p.dstPtr = d_C;
cudaMemcpy3D(&p);

system("pause");
return 0;
}

How do I make a kernel that loops over each element in the arrays and adds them together? 如何创建一个循环遍历数组中每个元素的内核并将它们一起添加?

There is an example on page 21 of the CUDA 4.0 programming guide for looping over 2D array of floats: CUDA 4.0编程指南的第21页有一个示例,用于循环遍历2D浮点数组:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);


// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
   for (int r = 0; r < height; ++r) 
    {
       float* row = (float*)((char*)devPtr + r * pitch);
          for (int c = 0; c < width; ++c) 
              {
              float element = row[c];
              }
     }
}

rewrite it to sum up elements should be easy. 重写它来总结元素应该很容易。 Additionally you can refer to this thread. 另外,您可以参考主题。 When efficiency is concern, you might also look on parallel reduction approach in CUDA. 当关注效率时,您可能还会考虑CUDA中的并行缩减方法。 This is used for example when implementing Monte Carlo simulation (see Multi Monte Carlo example). 例如,在实施蒙特卡罗模拟时使用它(参见Multi Monte Carlo示例)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM