简体   繁体   English

Cuda矩阵复制程序非常慢

[英]Cuda matrix copy program is very slow

Here's my CUDA code : 这是我的CUDA代码

#include<stdio.h>
#include<assert.h>
void verify(float * A, float * B, int size);

__global__ void CopyData(float *d_array, float* d_dest_array, size_t pitch, int cols, int rows)
{
  for(int i=0; i<rows; i++){
        float *rowData = (float*)(((char*)d_array) + (i*pitch));
        for(int j=0; j<cols; j++){
            d_dest_array[i*cols+j] = *(rowData+j);
        }
    }
}

int main(int argc, char **argv)
{
    int row, col, i, j; 
    float time1, time2;
    float *d_array;                 // dev arr which mem will be alloc to
    float *d_dest_array;        // dev arr that will be a copy
    size_t pitch;                       // ensures correct data struc alignm    
    if(argc != 3)
  {
        printf("Usage: %s [row] [col]\n", argv[0]);
        return 1;
  }

    row = atoi(argv[1]);
    col = atoi(argv[2]);
    float *h1_array = new float[col*row];
    float *h2_array = new float[col*row];
    float *h_ori_array = new float[col*row];
    for (i = 0; i<row; i++){
        for(j = 0; j<col; j++){
            h_ori_array[i*col+j] = i*col + j;
        }
    }
    cudaEvent_t start, stop;

    cudaMallocPitch(&d_array, &pitch, col*sizeof(float), row);
    cudaMalloc(&d_dest_array, col*row*sizeof(float));
    cudaMemcpy2D(d_array, pitch, h_ori_array, col*sizeof(float), col*sizeof(float), row, cudaMemcpyHostToDevice);
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    //CopyData<<<100, 512>>>(d_array, d_dest_array, pitch, col, row);
    for (i = 0; i<row; i++){
        for(j = 0; j<col; j++){
            h1_array[i*col+j] = h_ori_array[i*col+j];
        }
    }
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time1, start, stop);

    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    CopyData<<<row*col/512, 512>>>(d_array, d_dest_array, pitch, col, row);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time2, start, stop);

    cudaMemcpy2D(h2_array, pitch, d_dest_array, col*sizeof(float), col*sizeof(float), row, cudaMemcpyDeviceToHost);

    verify(h1_array, h2_array, row*col);

    free(h1_array); free(h2_array); free(h_ori_array);
  cudaFree(d_array); cudaFree(d_dest_array);
    printf("Exec time in ser = %f, par = %f ms with pitch %d", time1, time2, (int)pitch); 

    return 0;
}

void verify(float * A, float * B, int size)
{
    for (int i = 0; i < size; i++)
    {
        assert(A[i]==B[i]);
    }
     printf("Correct!");
}

It just makes a copy of a matrix. 它只是复制一个矩阵。 Both a serial and parallel version are written so that I can compare them. 编写了串行和并行版本,以便我可以进行比较。

It gives wrong answer if the array size is 64. For 256 and beyond, it gives correct answer. 如果数组大小为64,则会给出错误的答案。对于256及更高版本,它将给出正确的答案。 However it takes too long, 4 seconds for a 512x512 matrix. 但是,对于512x512矩阵,该过程花费的时间过长,需要4秒。

I am not comfortable with cudaMemcpy2D. 我对cudaMemcpy2D不满意。 Can someone please pinpoint what I am doing wrong? 有人可以指出我在做什么错吗? Any suggestion regarding CUDA coding practices will also be appreciated. 关于CUDA编码实践的任何建议也将不胜感激。 Also, while calling a kernel, how do I decide the block and grid dimension? 另外,在调用内核时,如何确定块和网格的尺寸?

EDIT 1: The CopyData function that I have used does not use parallelism. 编辑1:我使用的CopyData函数不使用并行性。 I foolishly copied it from VIHARRI's answer at the bottom of the page . 我愚蠢地从页面底部VIHARRI的回答中复制了它。

The selected answer over there does not specify how the data was copied from host to device. 在该处选择的答案未指定如何将数据从主机复制到设备。 Can someone show how it can be done using the cudaMallocPitch and cudaMemcpy2D functions? 有人可以演示如何使用cudaMallocPitch和cudaMemcpy2D函数吗? I am looking for the correct way to index inside the kernel as well as the correct way to copy a 2D array from host to device. 我正在寻找在内核中建立索引的正确方法,以及将2D阵列从主机复制到设备的正确方法。

You're only running a single CUDA thread. 您只运行一个CUDA线程。 (Actually on closer inspection you are running the same code in multiple threads, but the result is the same: you're not really exploiting the GPU hardware). (实际上,在仔细检查中,您正在多个线程中运行相同的代码,但结果是相同的:您并未真正利用GPU硬件)。

Ideally you need to run hundreds or thousands of concurrent threads to get best performance. 理想情况下,您需要运行数百或数千个并发线程以获得最佳性能。 One way to do this would be to have one thread per output element and then in each thread use the grid, block and thread IDs to determine which output element to process. 一种方法是每个输出元素有一个线程,然后在每个线程中使用网格,块和线程ID来确定要处理哪个输出元素。 Look at the examples in the CUDA SDK to understand the general pattern for parallel processing with CUDA. 查看CUDA SDK中的示例,以了解使用CUDA进行并行处理的一般模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM