在多GPU系统中使用CUDA迭代一维阵列

Question

I've been studying parallel programming in the last couple of months and now I am trying to adapt my application to a multi-GPUs platform. 在过去的几个月中，我一直在研究并行编程，现在我正在尝试使我的应用程序适应多GPU平台。 The problem is that I still do not understand very well how I can iterate through the array using multiple GPUs. 问题是我仍然不太了解如何使用多个GPU遍历阵列。

Do I need do divide my main array into smaller sub-arrays and send each one to each GPU or there is a way of make each GPU iterate in a fragment of the array? 我是否需要将主数组划分为较小的子数组，然后将每个子数组发送给每个GPU，或者有一种方法可以使每个GPU在数组的片段中进行迭代？ I have the serial and single-GPU version of this application working and I've been trying to use different methods to solve this problem and adapt it to the multi-GPUs but none of them return the same results as the two previous versions. 我已经可以使用该应用程序的串行和单GPU版本，并且我一直在尝试使用不同的方法来解决此问题并将其适应于多GPU，但是它们均未返回与前两个版本相同的结果。 I do not know what more I can do, so my conclusion is that I am not understanding how to iterate through the array in the multi-GPU system. 我不知道我还能做些什么，所以我的结论是我不了解如何在多GPU系统中遍历阵列。 Can someone help me, please? 有人能帮助我吗？

My code runs N iterations, and in each iteration it go through each value in my array (that represents an grid) and calculate a new value for it. 我的代码运行N次迭代，并在每次迭代中遍历数组（代表网格）中的每个值，并为其计算一个新值。

This is a sketch of how my code looks like right now: 这是我的代码现在的样子的草图：

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

#define DIM     24
#define BLOCK_SIZE 16
#define SRAND_VALUE 585

__global__ void random(int* t, int* newT){

    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;

    if (iy <= DIM && ix <= DIM) {
        if (t[id] % 2 == 0)
            newT[id] = t[id]*3;
        else
            newT[id] = t[id]*5;
    }
}

int main(int argc, char* argv[]){
    int i,j, devCount;
    int *h_test, *d_test, *d_tempTest, *d_newTest;
    size_t gridBytes;

    cudaGetDeviceCount(&devCount);

    gridBytes = sizeof(int)*(DIM)*(DIM);
    h_test = (int*)malloc(gridBytes);

    srand(SRAND_VALUE);
    #pragma omp parallel for private(i,j)
        for(i = 1; i<=DIM;i++) {
            for(j = 1; j<=DIM; j++) {
                h_test[i*(DIM)+j] = rand() % 2;
            }
        }

    if (devCount == 0){
        printf("There are no devices in this machine!");
        return 1; // if there is no GPU, then break the code
    }

    dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
    int  linGrid = (int)ceil(DIM/(float)BLOCK_SIZE);
    dim3 gridSize(linGrid,linGrid,1);

    dim3 cpyBlockSize(BLOCK_SIZE,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(DIM/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((DIM+2)/(float)cpyBlockSize.x),1,1);

    else if (devCount == 1){

        cudaMalloc(&d_test, gridBytes);
        cudaMalloc(&d_tempTest, gridBytes);
        cudaMalloc(&d_newTest, gridBytes);

        cudaMemcpy(d_test, h_test, gridBytes, cudaMemcpyHostToDevice);

        for (iter = 0; iter < DIM; iter ++){
            random<<<gridSize, blockSize>>>(d_test, d_newTest);

            d_tempTest = d_test;
            d_test = d_newTest;
            d_newTest = d_tempTest;
        }

        cudaMemcpy(h_test, d_test, gridBytes, cudaMemcpyDeviceToHost);

        return 0;
    }

    else{
        int nThreads, tId, current;
        omp_set_num_threads(devCount);

        for (iter = 0; iter < DIM; iter ++){

            #pragma omp parallel private(tId, h_subGrid, ) shared(h_grid, gridBytes)
            {
                tId = omp_get_thread_num();
                cudaSetDevice(tId);

                cudaMalloc(&d_test, gridBytes);
                cudaMalloc(&d_tempTest, gridBytes);
                cudaMalloc(&d_newTest, gridBytes);

                cudaMemcpy(d_grid, h_grid, gridBytes, cudaMemcpyHostToDevice);

                ******// What do I do here//******

            } 
        }
        return 0;
    }
}

Thanks in advance. 提前致谢。

Answer 1

The short answer: Yes, you should divide your array into subarrays for each GPU. 简短的答案：是的，您应该将每个GPU的阵列划分为子阵列。

Details: Each GPU has its own memory. 详细信息：每个GPU都有自己的内存。 In your code you allocate memory for the whole array on each GPU and copy the whole array to each GPU. 在您的代码中，您为每个GPU上的整个阵列分配内存，并将整个阵列复制到每个GPU。 Now you could operate on a subset of the array. 现在您可以对数组的一个子集进行操作。 But when you want to copy back you need to ensure that you copy only the updated parts of each array. 但是，当您要复制回去时，需要确保仅复制每个阵列的更新部分。 The better way from the beginning would be to copy only the part of the array that you want to update on the specific GPU. 从一开始的更好方法是仅复制要在特定GPU上更新的阵列部分。

Solution: Modify the multiGPU part to something like the following (you need to ensure that you don't miss elements if gridBytes%devCount != 0 , my code snippet does not check this) 解决方案：修改multiGPU部分，使其类似于以下内容（如果gridBytes%devCount != 0 ，那么我的代码片段将不会对此进行检查，请确保您不会错过任何元素）

int gridBytesPerGPU = gridBytes/devCount;
cudaMalloc(&d_test, gridBytesPerGPU);
cudaMalloc(&d_newTest, gridBytesPerGPU );

cudaMemcpy(d_test, &h_test[tId*gridBytesPerGPU], gridBytesPerGPU, cudaMemcpyHostToDevice); // copy only the part of the array that you want to use on that GPU
// do the calculation
cudaMemcpy(&h_test[tId*gridBytesPerGPU], d_newTest, gridBytesPerGPU, cudaMemcpyDeviceToHost);

Now you only need to calculate the appropriate block and grid size. 现在，您只需要计算适当的块和网格大小。 See (c) below. 参见下面的（c）。 If you have problems with that part then please ask in the comment and I will extend this answer. 如果您对此部分有疑问，请在评论中提出，我将扩展此答案。

Apart from that there are some parts in your code that I do not understand: 除此之外，您的代码中还有一些我不理解的部分：

a) Why do you need to swap the pointers? a）为什么需要交换指针？

b) You run the kernel part multiple times but the code in the for loop does not depend on the counter. b）您多次运行内核部分，但是for循环中的代码不取决于计数器。 Why? 为什么？ What do I miss? 我想念什么？

for (iter = 0; iter < DIM; iter ++){
    random<<<gridSize, blockSize>>>(d_test, d_newTest);

    d_tempTest = d_test;
    d_test = d_newTest;
    d_newTest = d_tempTest;
}

c) The calculation of grid and block size for this simple kernel looks a bit complicated (I skipped it when reading your question). c）这个简单内核的网格和块大小的计算看起来有些复杂（在阅读您的问题时我跳过了它）。 I would consider the problem as a one dimensional one, then everything will look much simpler including your kernel. 我认为这个问题是一维的，那么包括您的内核在内的所有东西看起来都会简单得多。

在多GPU系统中使用CUDA迭代一维阵列

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-07-16 20:57:44

在多GPU系统中使用CUDA迭代一维阵列

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-07-16 20:57:44

解决方案1
1 已采纳 2015-07-16 20:57:44