测量关于块数的cuda执行时间

Question

I wrote my first cuda script and wonder about how it is parallelized. 我写了我的第一个cuda脚本，并想知道它是如何并行化的。 I have some variables r0_dev and f_dev , arrays of length (*fnum_dev) * 3 each. 我有一些变量r0_dev和f_dev ，长度为(*fnum_dev) * 3数组。 On each block, they are read sequentially. 在每个块上，它们按顺序读取。 Then there are r_dev , which I read, and v_dev which I want to write at in parallel, both are arrays of length gnum * 3 . 再就是r_dev ，我读，和v_dev ，我想在平行于写，两者都是长度的阵列gnum * 3 。

The program produces the results that I want it to produce, but the complexity (function of time with respect to data size) is not what I would have expected. 该程序产生我希望它产生的结果，但复杂性（时间与数据大小的关系）不是我所期望的。

My expectation is, that, when the size of the array v_dev increases, the execution time stays constant, for values of gnum smaller than the number of blocks allowed in some dimension. 我的期望是，当数组v_dev的大小增加时，执行时间保持不变，因为gnum值小于某个维度中允许的块数。

Reality is different. 现实是不同的。 With the following code, the time was measured. 使用以下代码，测量时间。 A linear complexity is observed, what I would have expexted in sequential code. 观察到线性复杂性，我将在顺序代码中执行此操作。

dim3 blockGrid(gnum);

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

// the actual calculation
stokeslets<<<blockGrid, 1>>>(fnum_dev, r0_dev, f_dev, r_dev, v_dev);

// time measurement
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

Question: 题：

Is my expectation, described above, wrong? 如上所述，我的期望是错的吗？ What additional considerrations are important? 还有哪些其他考虑因素很重要？

Details 细节

The following shows the implementation of stokeslet . 以下显示了stokeslet的实现。 Maybe I'm doing something bad there? 也许我在做那些不好的事情？

__device__ void gridPoint(int offset, float* r0, float* f, float* r, float* v) {
    int flatInd = 3 * offset;

    float dr[3];
    float len = 0;
    float drf = 0;

    for (int i = 0; i < 3; i++) {
        dr[i] = r[i] - r0[i + flatInd];
        len += dr[i] * dr[i];
        drf += dr[i] * f[i + flatInd];
    }

    len = sqrt(len);
    float fak = 1 / (8 * 3.1416 * 0.7);
    v[0] +=  (fak / len) * (f[0 + flatInd] + (dr[0]) * drf / (len * len));
    v[1] +=  (fak / len) * (f[1 + flatInd] + (dr[1]) * drf / (len * len));
    v[2] +=  (fak / len) * (f[2 + flatInd] + (dr[2]) * drf / (len * len));
}


__global__ void stokeslets(int* fnum, float* r0, float* f, float* r, float* v) {
    // where are we (which block, which is equivalent to the grid point)?
    int idx = blockIdx.x;

    // we want to add all force contributions
    float rh[3] = {r[3 * idx + 0], r[3 * idx + 1], r[3 * idx + 2]};

    float vh[3] = {0, 0, 0};
    for (int i=0; i < *fnum; i++) {
        gridPoint(i, r0, f, rh, vh);
    }
    // sum intermediate velocity vh
    int flatInd = 3 * idx;
    v[0 + flatInd] += vh[0];
    v[1 + flatInd] += vh[1];
    v[2 + flatInd] += vh[2];
}

Answer 1

The main problem with you code is that you are running multiple blocks containing only one thread . 代码的主要问题是您运行的多个块只包含一个线程 。

Quoting the CUDA C Programming Guide 引用CUDA C编程指南

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). NVIDIA GPU架构围绕可扩展的多线程流式多处理器（SM）阵列构建。 When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. 当主机CPU上的CUDA程序调用内核网格时，将枚举网格块并将其分配给具有可用执行容量的多处理器。 The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. 线程块的线程在一个多处理器上并发执行，并且多个线程块可以在一个多处理器上并发执行。 As thread blocks terminate, new blocks are launched on the vacated multiprocessors. 当线程块终止时，在空出的多处理器上启动新块。

A multiprocessor is designed to execute hundreds of threads concurrently. 多处理器旨在同时执行数百个线程。

Quoting the answer to the post How CUDA Blocks/Warps/Threads map onto CUDA Cores? 引用答案如何将CUDA块/变形/线程映射到CUDA核心？

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. 程序员将工作划分为线程，将线程划分为线程块，将线程块划分为网格。 The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). 计算工作分配器将线程块分配给流式多处理器（SM）。 Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. 一旦将线程块分配给SM，就会分配线程块的资源（warp和共享内存），并将线程划分为32个线程的组，称为warps。 Once a warp is allocated it is called an active warp. 一旦分配了warp，它就被称为主动warp。 The two warp schedulers pick two active warps per cycle and dispatch warps to execution units . 两个warp调度程序每个周期选择两个活动warp并将warp调度到执行单元 。

From the two bold-marked sentences, it follows that you have only 2 threads running per clock cycle per streaming multiprocessor . 从两个带有粗体标记的句子开始， 每个流多处理器每个时钟周期只运行2个线程 。 This is the main reason why you are observing essentially the same computational complexity as for the sequential case. 这是您观察与顺序情况基本相同的计算复杂性的主要原因。

The recommendation is to rewrite your code/kernel to host the possibility to run multiple threads per block. 建议重写代码/内核以承载每个块运行多个线程的可能性。

Further readings: The Fermi architecture whitepaper . 进一步阅读：费米建筑白皮书。

测量关于块数的cuda执行时间

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-03-23 22:32:34

测量关于块数的cuda执行时间

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-03-23 22:32:34

解决方案1
3 已采纳 2014-03-23 22:32:34