为什么增加cuda中的方块数量会增加时间？

Question

My understanding is that in CUDA, increase the number of blocks will not increase the time as they are implemented parallelly, but in my code, if I double the number of blocks, the time doubled as well. 我的理解是，在CUDA中，增加块数不会增加并行执行的时间，但是在我的代码中，如果将块数加倍，时间也会加倍。

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#define num_of_blocks 500
#define num_of_threads 512

__constant__ double y = 1.1;

// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &globalState[idx]);
}

// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
    // global threads id
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;

    // local threads id
    int lidx    = threadIdx.x;

    // creat shared memory to store seeds
    __shared__ curandState localState[num_of_threads];

    // shared memory to store samples
    __shared__ double sample[num_of_threads];

    // copy global seed to local
    localState[lidx]    = globalState[idx];
    __syncthreads();

    sample[lidx]    =  y + curand_normal_double(&localState[lidx]);

    if(lidx == 0){
        // save the first sample to dev_data;
        dev_data[blockIdx.x] = sample[0];
    }

    globalState[idx]    = localState[lidx];
}

int main(){
    // creat random number seeds;
    curandState *globalState;
    cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
    initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);

    double *dev_data;
    cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // Start record
    cudaEventRecord(start, 0);

    test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);

    // Stop event
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
    // Clean up:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    std::cout << "Time ellapsed: " << elapsedTime << std::endl;

    cudaFree(dev_data);
    cudaFree(globalState);
    return 0;
}

The test result is: 测试结果为：

number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.

So what is the reason that the time will increase? 那么时间会增加的原因是什么呢？ Is it because I access the constant memory or I copy the data from shared memory to global memory? 是因为我访问常量内存还是将数据从共享内存复制到全局内存？ Is that some ways to optimise it? 有一些优化方法吗？

Answer 1

While the number of blocks being able to run in parallel can be large, it is still finite due to limited on-chip resources. 尽管能够并行运行的块数量可能很大，但由于片上资源有限，因此数量仍然有限。 If the number of blocks requested in a kernel launch exceeds that limit, any further blocks have to wait for earlier blocks to finish and free up their resources. 如果内核启动中请求的块数超过该限制，则任何其他块都必须等待较早的块完成并释放其资源。

One limited resource is shared memory, of which your kernel uses 28 kilobytes. 一种有限的资源是共享内存，您的内核使用28 KB。 CUDA 8.0 compatible Nvidia GPUs offer between 48 and 112 kilobytes of shared memory per streaming multiprocessor (SM), so that the maximum number of blocks running at any one time is between 1× and 3× the number of SMs on your GPU. 兼容CUDA 8.0的Nvidia GPU每个流式多处理器（SM）提供48至112 KB的共享内存，因此，任何一次运行的最大块数为GPU上SM数的1倍至3倍。

Other limited resources are registers and various per-warp resources in the scheduler. 其他有限的资源是调度程序中的寄存器和各种按线程分配的资源。 The CUDA occupancy calculator is a convenient Excel spreadsheet (also works with OpenOffice/LibreOffice) that shows you how these resources limit the number of blocks per SM for a specific kernel. CUDA占用率计算器是一个方便的Excel电子表格（也可与OpenOffice / LibreOffice一起使用），向您显示这些资源如何限制特定内核的每个SM的块数。 Compile the kernel adding the option --ptxas-options="-v" to the nvcc command line, locate the line saying "ptxas info : Used XX registers, YY bytes smem, zz bytes cmem[0], ww bytes cmem[2]", and enter XX , YY , the number of threads per block you are trying to launch, and the compute capability of your GPU into the spreadsheet. 编译内核，在nvcc命令行中添加选项--ptxas-options="-v" ，找到显示“ ptxas info：使用的XX寄存器， YY字节smem， zz字节cmem [0]， ww字节cmem [2]的行。 ]”，然后输入XX ， YY ，您要启动的每个块的线程数以及电子表格中GPU的计算能力。 It will then show the maximum number of blocks that can run in parallel on one SM. 然后，它将显示可以在一个SM上并行运行的最大块数。

You don't mention the GPU you have been running the test on, so I'll use a GTX 980 as an example. 您没有提及您在进行测试的GPU，因此我将以GTX 980为例。 It has 16 SMs with 96Kb of shared memory each, so at most 16×3=48 blocks can run in parallel. 它具有16个SM，每个SM具有96Kb的共享内存，因此最多可以并行运行16×3 = 48个块。 Had you not used shared memory, the maximum number of resident warps would have limited the number of blocks per SM to 4, allowing 64 blocks to run in parallel. 如果您没有使用共享内存，则最大驻留扭曲数会将每个SM的块数限制为4，从而允许64个块并行运行。

On any currently existing Nvidia GPU, your example requires at least about a dozen waves of blocks executing sequentially, explaining why doubling the number of blocks will also about double the runtime. 在任何现有的Nvidia GPU上，您的示例至少需要大约十二个块顺序执行的波形，这解释了为什么将块数加倍还会使运行时间加倍。

为什么增加cuda中的方块数量会增加时间？

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-12-06 00:58:16

为什么增加cuda中的方块数量会增加时间？

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-12-06 00:58:16

解决方案1
4 已采纳 2016-12-06 00:58:16