为什么此代码在GPU上比CPU慢十倍？

Question

I have a problem that boils down to performing some arithmetic on each element of a set of matrices. 我有一个问题，可以归结为对一组矩阵的每个元素执行一些算术运算。 I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. 我认为这听起来像一种计算，可以从转移到GPU上受益匪浅。 However, I've only succeeded in slowing down the computation by a factor of 10! 但是，我仅成功地将计算速度降低了10倍！

Here are the specifics of my test system: 这是我的测试系统的细节：

OS: Windows 10 作业系统：Windows 10
CPU: Core i7-4700MQ @ 2.40 GHz CPU：酷睿i7-4700MQ @ 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0) GPU：GeForce GT 750M（计算能力3.0）
CUDA SDK: v7.5 CUDA SDK：v7.5

The code below performs equivalent calcs to my production code, on the CPU and on the GPU. 下面的代码在CPU和GPU上执行与我的生产代码等效的计算。 The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s). 后者始终比我的机器慢十倍（CPU约650ms； GPU约7s）。

I've tried changing the grid and block sizes; 我尝试过更改网格和块的大小； I've increased and decreased the size of the array passed to the GPU; 我增加和减少了传递给GPU的数组的大小； I've run it through the visual profiler; 我已经通过可视分析器运行它； I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent. 我尝试使用整数数据而不是使用双精度数据，但是无论我做什么，GPU版本始终比同等CPU慢得多。

So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance? 那么为什么GPU版本这么慢，而我上面没有提到，有什么变化，我可以尝试改善其性能吗？

Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30 这是我的命令行： nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30

And here's the contents of source.cu: 这是source.cu的内容：

#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>

void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
    for (size_t i = 0; i < arrayLength; i++)
    {
        double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
        array[i] = array[i] * adjustmentFactor;
    }
}

__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx < arrayLength)
    {
        double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
        array[idx] = array[idx] * adjustmentFactor;
    }
}

void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
    double *dev_row, *dev_curve;

    cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
    cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);

    cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);

    CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);

    cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);

    cudaFree(dev_curve);
    cudaFree(dev_row);
}

void FillArray(int length, double row[])
{
    for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}

int main(void)
{
    const int arrayLength = 10000;

    double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;

    FillArray(arrayLength, curve1);
    FillArray(arrayLength, curve2);

    ///////////////////////////////////// CPU Version ////////////////////////////////////////

    LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;

    QueryPerformanceFrequency(&Frequency);
    QueryPerformanceCounter(&StartingTime);

    for (size_t iterations = 0; iterations < 10000; iterations++)
    {
        FillArray(arrayLength, arrayForCPU);
        AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
    }

    QueryPerformanceCounter(&EndingTime);

    ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMilliseconds.QuadPart *= 1000;
    ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
    std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;

    ///////////////////////////////////// GPU Version ////////////////////////////////////////

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);

    for (size_t iterations = 0; iterations < 10000; iterations++)
    {
        FillArray(arrayLength, arrayForGPU);
        AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
    }

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);

    std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return 0;
}

And here is an example of the output of CUDASpeedTest.exe 这是CUDASpeedTest.exe输出的示例

Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76

Answer 1

What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology. 对于大多数使用CUDA的开发人员来说，接下来发生的事情可能会令人尴尬地显而易见，但对于其他人（如我本人）却是有价值的，因为他们对这项技术是新的。

The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics. GPU代码比CPU慢十倍，因为GPU代码表现出了完美的性能破坏特性。

The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host. GPU代码花费大部分时间在GPU上分配内存，将数据复制到设备，执行非常非常简单的计算（无论运行的处理器类型如何，这都非常快），然后将数据从设备复制回给主机。

As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. 如评论中所述，如果要处理的数据结构的大小存在上限，则GPU上的缓冲区可以精确分配一次并重新使用。 In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1. 在上面的代码中，这将GPU至CPU的运行时间从10：1降低到4：1。

The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. 剩余的性能差异归因于以下事实：由于其简单性，CPU能够在很短的时间内连续执行数百万次所需的计算。 In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment to an array element. 在上面的代码中，计算涉及从数组中读取值，进行一些乘法运算，最后是对数组元素的赋值。 Something this simple must be performed millions of times before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. 这种简单的操作必须执行数百万次，才能并行执行，其好处超过了将数据传输到GPU并返回的必要时间。 On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time. 在我的测试系统上，盈亏平衡点是一百万个数组元素，GPU和CPU在（大约）相同的时间内执行。

为什么此代码在GPU上比CPU慢十倍？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-04-18 11:38:10

为什么此代码在GPU上比CPU慢十倍？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-04-18 11:38:10

解决方案1
2 已采纳 2016-04-18 11:38:10