CUDA 的性能取决于声明变量

Question

Is there any tip for improving CUDA performance in that case such as declaring global/local variable, parameter passing, memory copy.在这种情况下，是否有任何提高 CUDA 性能的技巧，例如声明全局/局部变量、参数传递、内存复制。

I'm trying to figure out the reason why two performance are too different between sum_gpu_FAST and sum_gpu_SLOW in example below.我试图找出为什么在下面的示例中 sum_gpu_FAST 和 sum_gpu_SLOW 之间的两种性能差异太大的原因。

Here you can see the whole example code.在这里你可以看到整个示例代码。

#include <iostream>
#include <chrono>
#define N 10000000
__global__
void sum_gpu_FAST(int (&data)[N][2], int& sum, int n) {  // runtime : 2.42342s
    int s = 0;
    for (int i = 0; i < n; i++) 
        s += data[i][0] * 10 + data[i][1];
    sum = s;
}
__global__
void sum_gpu_SLOW(int (&data)[N][2], int& sum, int n) {  // runtime : 436.64ms
    sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i][0] * 10 + data[i][1];
    }
}
void sum_cpu(int (*data)[2], int& sum, int n) {
    for (int i = 0; i < n; i++) {
        sum +=  data[i][0] * 10 + data[i][1];
    }
}
int main()
{
    int (*v)[2] = new int[N][2];
    for (int i = 0; i < N; i++)
        v[i][0] = 1, v[i][1] = 3;
    printf ("-CPU------------------------------------------------\n");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_cpu(v, sum, N);
        auto end   = std::chrono::system_clock::now();
        // print output
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf ("-GPU-Ready------------------------------------------\n");
    int *dev_sum       = nullptr;
    int (*dev_v)[N][2] = nullptr;
    cudaMalloc((void **)&dev_v,   sizeof(int[N][2]));
    cudaMalloc((void **)&dev_sum, sizeof(int));
    cudaMemcpy(dev_v, v, sizeof(int[N][2]), cudaMemcpyHostToDevice);
    printf("-GPU-FAST-------------------------------------------\n");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_gpu_FAST<<<1, 1>>> (*dev_v, *dev_sum, N);
        cudaDeviceSynchronize(); // wait until end of kernel
        auto end   = std::chrono::system_clock::now();
        // print output
        cudaMemcpy( &sum, dev_sum, sizeof(int), cudaMemcpyDeviceToHost );
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf("-GPU-SLOW-------------------------------------------\n");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_gpu_SLOW<<<1, 1>>> (*dev_v, *dev_sum, N);
        cudaDeviceSynchronize(); // wait until end of kernel
        auto end   = std::chrono::system_clock::now();
        // print output
        cudaMemcpy( &sum, dev_sum, sizeof(int), cudaMemcpyDeviceToHost );
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf("----------------------------------------------------\n");
    return 0;
}

Answer 1

I'm trying to figure out the reason why two performance are too different between sum_gpu_FAST and sum_gpu_SLOW in example below.我试图找出为什么在下面的示例中 sum_gpu_FAST 和 sum_gpu_SLOW 之间的两种性能差异太大的原因。

In the fast case, you are creating a local variable which is contained (presumably) in a register:在快速的情况下，您正在创建一个包含（大概）在寄存器中的局部变量：

int s = 0;

During the loop iterations, reads are occurring from global memory, but the only write operation is to a register:在循环迭代期间，读取发生在全局内存中，但唯一的写操作是对寄存器：

for (int i = 0; i < n; i++) 
    s += data[i][0] * 10 + data[i][1];

In the slow case, the running sum is contained in a variable resident in global memory:在慢速情况下，运行总和包含在驻留在全局内存中的变量中：

sum = 0;

therefore, at each loop iteration, the updated value is written to global memory:因此，在每次循环迭代时，更新的值都会写入全局内存：

for (int i = 0; i < n; i++) {
    sum += data[i][0] * 10 + data[i][1];

Therefore the loop has additional overhead to write to global memory at each iteration, which is slower than maintaining the sum in a register.因此，循环在每次迭代时都有额外的开销来写入全局内存，这比在寄存器中维护总和要慢。

I'm not going to completely dissect the SASS code to compare these two cases, because the compiler is making other decisions in the fast case around loop unrolling and possibly other factors, but my guess is that the lack of a need to store results to global memory during the loop iterations considerably assists with loop unrolling as well.我不打算完全剖析 SASS 代码来比较这两种情况，因为编译器正在围绕循环展开和可能的其他因素在快速情况下做出其他决定，但我的猜测是不需要将结果存储到循环迭代期间的全局内存也大大有助于循环展开。 However we can make a simple deduction based on the tail end of the SASS code for each case:但是，我们可以根据每种情况的 SASS 代码的尾端进行简单的推论：

                Function : _Z12sum_gpu_FASTRA10000000_A2_iRii
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                        /* 0x00000a0000017a02 */
                                                                                 /* 0x000fd00000000f00 */
...
        /*0b00*/                   STG.E.SYS [R2], R20 ;                         /* 0x0000001402007386 */
                                                                                 /* 0x000fe2000010e900 */
        /*0b10*/                   EXIT ;                                        /* 0x000000000000794d */
                                                                                 /* 0x000fea0003800000 */

In the fast case above, we see that there is a single global store ( STG ) instruction at the end of the kernel, right before the return statement ( EXIT ), and outside of any loops in the kernel.在上面的 fast 情况下，我们看到在内核末尾，就在 return 语句 ( EXIT ) 之前，以及内核中的任何循环之外，有一个全局存储 ( STG ) 指令。 Although I haven't shown it all, indeed there are no other STG instructions in the fast kernel, except the one at the end.虽然我没有全部展示出来，但在快速内核中确实没有其他STG指令，除了最后的一条。 We see a different story looking at the tail end of the slow kernel:我们看到了一个不同的故事，看看慢内核的尾端：

        code for sm_70
                Function : _Z12sum_gpu_SLOWRA10000000_A2_iRii
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;       /* 0x00000a00ff017624 */
                                                                                 /* 0x000fd000078e00ff */
...
        /*0460*/                   STG.E.SYS [R2], R7 ;                          /* 0x0000000702007386 */
                                                                                 /* 0x0005e2000010e900 */
        /*0470*/              @!P0 BRA 0x2f0 ;                                   /* 0xfffffe7000008947 */
                                                                                 /* 0x000fea000383ffff */
        /*0480*/                   EXIT ;                                        /* 0x000000000000794d */
                                                                                 /* 0x000fea0003800000 */

The slow kernel ends a loop with the STG instruction inside the loop.慢内核以循环内的STG指令结束循环。 The slow kernel also has many instances of the STG instruction throughout the kernel, presumably because of compiler unrolling.慢速内核在整个内核中也有许多STG指令的实例，大概是因为编译器展开。

CUDA 的性能取决于声明变量

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-12-18 04:57:24

CUDA 的性能取决于声明变量

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-12-18 04:57:24

解决方案1
1 已采纳 2019-12-18 04:57:24