一个大内核与许多小内核和内存拷贝 (CUDA) 的并发

Question

I am developing a Multi-GPU accelerated Flow solver.我正在开发一个多 GPU 加速流解算器。 Currently I am trying to implement communication hiding.目前我正在尝试实现通信隐藏。 That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of the mesh, once communication is done.这意味着，在交换数据时，GPU 计算网格中不参与通信的部分，并在通信完成后计算网格的其余部分。

I am trying to solve this by having one stream ( computeStream ) for the long run time kernel ( fluxKernel ) and one ( communicationStream ) for the different phases of communication.我试图通过为长时间运行的内核 ( fluxKernel ) 使用一个流 ( computeStream ) 和用于不同通信阶段的一个 ( communicationStream ) 来解决这个问题。 The computeStream has a very low priority, in order to allow kernels on the communicationStream to interleave the fluxKernel , even though it uses all resources.该computeStream具有非常低的优先级，以允许在内核communicationStream交错的fluxKernel ，尽管它使用的所有资源。

These are the streams I am using:这些是我正在使用的流：

int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
cudaStreamCreateWithPriority (&communicationStream, cudaStreamNonBlocking, priority_high );
cudaStreamCreateWithPriority (&computeStream      , cudaStreamNonBlocking, priority_low  );

The desired cocurrency pattern looks like this:所需的并发模式如下所示：

I need synchronization of the communicationStream before I send the data via MPI, to ensure that the data is completely downloaded, before I send it on.在通过 MPI 发送数据之前，我需要同步communicationStream ，以确保在发送数据之前完全下载数据。

In the following listing I show the structure of what I am currently doing.在下面的清单中，我展示了我目前正在做的事情的结构。 First I start the long run time fluxKernel for the main part of the mesh on the computeStream .首先，我开始长期运行时间fluxKernel有关网格的主要部分computeStream 。 Then I start a sendKernel that collects the data that should be send to the second GPU and subsequently download it to the host (I cannot use cuda-aware MPI due to hardware limitations).然后我启动一个sendKernel来收集应该发送到第二个 GPU 的数据，然后将其下载到主机（由于硬件限制，我无法使用 cuda-aware MPI）。 The data is then send non-blocking per MPI_Isend and blocking receive ( MPI_recv ) is used subsequently.然后，数据被发送每非阻塞MPI_Isend和阻断接收（ MPI_recv ）随后使用。 When the data is received the procedure is done backwards.当接收到数据时，过程向后进行。 First the data is uploaded to the device and then spread to the main data structure by recvKernel .首先将数据上传到设备，然后通过recvKernel传播到主数据结构。 Finally the fluxKernel is called for the remaining part of the mesh on the communicationStream .最后， fluxKernel被要求对网格的剩余部分communicationStream 。

Note, that before and after the shown code kernels are run on the default stream.请注意，在显示的代码内核之前和之后在默认流上运行。

{ ... } // Preparations

// Start main part of computatation on first stream

fluxKernel<<< ..., ..., 0, computeStream >>>( /* main Part */ );

// Prepare send data

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );

// MPI Communication

MPI_Isend( ... );
MPI_Recv ( ... );

// Use received data

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );

recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

fluxKernel<<< ..., ..., 0, communicationStream >>>( /* remaining Part */ );

{ ... } // Rest of the Computations

I used nvprof and Visual Profiler to see, whether the stream actually execute concurrently.我使用 nvprof 和 Visual Profiler 来查看流是否实际并发执行。 This is the result:这是结果：

I observe that the sendKernel (purple), upload, MPI communication and download are concurrent to the fluxKernel .我观察到sendKernel （紫色）、上传、MPI 通信和下载与fluxKernel并发。 The recvKernel (red) only starts ofter the other stream is finished, though.但是， recvKernel （红色）仅在其他流完成后启动。 Turning of the synchronization does not solve the problem:关闭同步并不能解决问题：

For my real application I have not only one communication, but multiple.对于我的实际应用程序，我不仅有一种通信，而且有多种通信。 I tested this with two communications as well.我也用两次通信对此进行了测试。 The procedure is:程序是：

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

The result is similar to the one with one communication (above), in the sense that the second kernel invocation (this time it is a sendKernel ) is delayed till the kernel on the computeStream is finished.其结果是一个类似于与一个通信（以上），在这个意义上，所述第二内核调用（此时它是一个sendKernel ）被延迟，直到在内核computeStream结束。

Hence the overall observation is, that the second kernel invocation is delayed, independent of which kernel this is.因此，总体观察结果是，第二次内核调用被延迟，与这是哪个内核无关。

Can you explain, why the GPU is synchronizing in this way, or how I can get the second Kernel on communicationStream to also run concurrently to the computeStream?你能解释一下，为什么 GPU 以这种方式同步，或者我如何让communicationStream上的第二个内核也与计算流同时运行？

Thank you very much.非常感谢。

Edit 1: complete rework of the question编辑1：完成问题的返工

Minimal Reproducible Example最小可重复示例

I built a minimal reproducible Example.我构建了一个最小的可重现示例。 In the end the code plots the int data to the terminal.最后，代码将int数据绘制到终端。 The correct last value would be 32778 (=(32*1024-1) + 1 + 10).正确的最后一个值是 32778 (=(32*1024-1) + 1 + 10)。 At the beginning I added an option integer to test 3 different options:一开始我添加了一个选项整数来测试 3 个不同的选项：

0: Intended version with synchronisation before CPU modification of data 0：在 CPU 修改数据之前同步的预期版本
1: Same as 0, but without synchronization 1：同0，但不同步
2: dedicated stream for memcpys and no syncronization 2：memcpys 专用流，无同步

#include <iostream>

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

const int option = 0;

const int numberOfEntities = 2 * 1024 * 1024;
const int smallNumberOfEntities = 32 * 1024;

__global__ void longKernel(float* dataDeviceIn, float* dataDeviceOut, int numberOfEntities)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    float tmp = dataDeviceIn[index];

#pragma unroll
    for( int i = 0; i < 2000; i++ ) tmp += 1.0;

    dataDeviceOut[index] = tmp;
}

__global__ void smallKernel_1( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    smallDeviceData[index] = index;
}

__global__ void smallKernel_2( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    int value = smallDeviceData[index];

    value += 10;

    smallDeviceData[index] = value;
}


int main(int argc, char **argv)
{
    cudaSetDevice(0);

    float* dataDeviceIn;
    float* dataDeviceOut;

    cudaMalloc( &dataDeviceIn , sizeof(float) * numberOfEntities );
    cudaMalloc( &dataDeviceOut, sizeof(float) * numberOfEntities );

    int* smallDataDevice;
    int* smallDataHost;

    cudaMalloc    ( &smallDataDevice, sizeof(int) * smallNumberOfEntities );
    cudaMallocHost( &smallDataHost  , sizeof(int) * smallNumberOfEntities );

    cudaStream_t streamLong;
    cudaStream_t streamSmall;
    cudaStream_t streamCopy;

    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
    cudaStreamCreateWithPriority (&streamLong , cudaStreamNonBlocking, priority_low  );
    cudaStreamCreateWithPriority (&streamSmall, cudaStreamNonBlocking, priority_high );
    cudaStreamCreateWithPriority (&streamCopy , cudaStreamNonBlocking, priority_high );

    //////////////////////////////////////////////////////////////////////////

    longKernel <<< numberOfEntities / 32, 32, 0, streamLong >>> (dataDeviceIn, dataDeviceOut, numberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    smallKernel_1 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    if( option <= 1 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamCopy  );

    if( option == 0 ) cudaStreamSynchronize( streamSmall );

    // some CPU modification of data
    for( int i = 0; i < smallNumberOfEntities; i++ ) smallDataHost[i] += 1;

    if( option <= 1 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamCopy  );

    smallKernel_2 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    cudaDeviceSynchronize();

    cudaMemcpy( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost );

    for( int i = 0; i < smallNumberOfEntities; i++ ) std::cout << smallDataHost[i] << "\n";

    return 0;
}

With code I see the same behavior as described above:使用代码，我看到了与上述相同的行为：

Option 0 (correct result):选项 0（正确结果）：

Option 1 (wrong reslut, +1 from CPU missing):选项 1（错误的结果，CPU 缺失 +1）：

Option 2 (completely wrong result, all 10, dowload before smallKernel_1 )选项 2（完全错误的结果，全部为 10，在smallKernel_1之前smallKernel_1 ）

Solutions:解决方案：

Running Option 0 under Linux (on the suggestion in Roberts answere), brings the expected behavior!在 Linux 下运行 Option 0（根据 Roberts answere 中的建议），会带来预期的行为！

Answer 1

Here's how I would try to accomplish this.这是我将如何尝试实现这一目标。

Use a high-priority/low-priority stream arrangement as you suggest.按照您的建议使用高优先级/低优先级流安排。
Only 2 streams should be needed只需要 2 个流
Make sure to pin host memory to allow compute/copy overlap确保固定主机内存以允许计算/复制重叠
Since you don't intend to use cuda-aware MPI, your MPI transactions are purely host activity.由于您不打算使用 cuda-aware MPI，您的 MPI 事务纯粹是主机活动。 Therefore we can use a stream callback to insert this host activity into the high-priority stream.因此，我们可以使用流回调将此主机活动插入到高优先级流中。
To allow the high-priority kernels to easily insert themselves into the low-priority kernels, I choose a design strategy of grid-stride-loop for the high priority copy kernels, but non-grid-stride-loop for low priority kernels.为了让高优先级内核可以轻松地将自己插入到低优先级内核中，我为高优先级复制内核选择了网格步幅循环的设计策略，而对于低优先级内核则选择了非网格步幅循环。 We want the low priority kernels to have a larger number of blocks, so that blocks are launching and retiring all the time, easily allowing the GPU block scheduler to insert high-priority blocks as they become available.我们希望低优先级内核拥有更多的块，以便块一直在启动和退出，从而轻松地允许 GPU 块调度程序在高优先级块可用时插入它们。
The work issuance per "frame" uses no synchronize calls of any kind.每个“帧”的工作发布不使用任何类型的同步调用。 I am using a cudaDeviceSynchronize() once per loop/frame, to break (separate) the processing of one frame from the next.我在每个循环/帧中使用cudaDeviceSynchronize()一次，以中断（分离）一帧与下一帧的处理。 Arrangement of activities within a frame is handled entirely with CUDA stream semantics, to enforce serialization for activities which depend on each other, but to allow concurrency for activities that don't.框架内活动的安排完全由 CUDA 流语义处理，以强制执行相互依赖的活动的序列化，但允许不相互依赖的活动并发。

Here's a sample code that implements these ideas:这是实现这些想法的示例代码：

#include <iostream>
#include <unistd.h>
#include <cstdio>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef double mt;
const int nTPB = 512;
const size_t ds = 100ULL*1048576;
const size_t bs = 1048576ULL;
const int  my_intensity = 1;
const int loops = 4;
const size_t host_func_delay_us = 100;
const int max_blocks = 320; // chosen based on GPU, could use runtime calls to set this via cudaGetDeviceProperties

template <typename T>
__global__ void fluxKernel(T * __restrict__ d, const size_t n, const int intensity){

  size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x;
  if (idx < n){
    T temp = d[idx];
    for (int i = 0; i < intensity; i++)
      temp = sin(temp);  // just some dummy code to simulate "real work"
    d[idx] = temp;
    }
}

template <typename T>
__global__ void sendKernel(const T * __restrict__ d, const size_t n, T * __restrict__ b){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    b[idx] = d[idx];
}

template <typename T>
__global__ void recvKernel(const T * __restrict__ b, const size_t n, T * __restrict__ d){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    d[idx] = b[idx];
}

void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
    printf("Loop %lu callback\n", (size_t)data);
    usleep(host_func_delay_us); // simulate: this is where non-cuda-aware MPI calls would go, operating on h_buf
}
int main(){

  // get the range of stream priorities for this device
  int priority_high, priority_low;
  cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
  // create streams with highest and lowest available priorities
  cudaStream_t st_high, st_low;
  cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
  cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
  // allocations
  mt *h_buf, *d_buf, *d_data;
  cudaMalloc(&d_data, ds*sizeof(d_data[0]));
  cudaMalloc(&d_buf, bs*sizeof(d_buf[0]));
  cudaHostAlloc(&h_buf, bs*sizeof(h_buf[0]), cudaHostAllocDefault);
  cudaCheckErrors("setup error");
  // main processing loop
  for (unsigned long i = 0; i < loops; i++){
    // issue low-priority
    fluxKernel<<<((ds-bs)+nTPB)/nTPB, nTPB,0,st_low>>>(d_data+bs, ds-bs, my_intensity);
    // issue high-priority
    sendKernel<<<max_blocks,nTPB,0,st_high>>>(d_data, bs, d_buf);
    cudaMemcpyAsync(h_buf, d_buf, bs*sizeof(h_buf[0]), cudaMemcpyDeviceToHost, st_high);
    cudaStreamAddCallback(st_high, MyCallback, (void*)i, 0);
    cudaMemcpyAsync(d_buf, h_buf, bs*sizeof(h_buf[0]), cudaMemcpyHostToDevice, st_high);
    recvKernel<<<max_blocks,nTPB,0,st_high>>>(d_buf, bs, d_data);
    fluxKernel<<<((bs)+nTPB)/nTPB, nTPB,0,st_high>>>(d_data, bs, my_intensity);
    cudaDeviceSynchronize();
    cudaCheckErrors("loop error");
    }
  return 0;
}

Here is the visual profiler timeline output (on linux, Tesla V100):这是可视化分析器时间线输出（在 linux 上，Tesla V100）：

Note that arranging complex concurrency scenarios can be quite challenging on Windows WDDM.请注意，在 Windows WDDM 上安排复杂的并发场景可能非常具有挑战性。 I would recommend avoiding that, and this answer does not intend to discuss all the challenges there.我建议避免这种情况，这个答案并不打算讨论那里的所有挑战。 I suggest using linux or Windows TCC GPUs to do this.我建议使用 linux 或 Windows TCC GPU 来做到这一点。

If you try this code on your machine, you may need to adjust some of the various constants to get things to look like this.如果你在你的机器上尝试这个代码，你可能需要调整一些不同的常量来让事情看起来像这样。

一个大内核与许多小内核和内存拷贝 (CUDA) 的并发

问题描述

Minimal Reproducible Example最小可重复示例

Solutions:解决方案：

1 个解决方案

解决方案1
2 已采纳 2019-07-17 15:01:57

一个大内核与许多小内核和内存拷贝 (CUDA) 的并发

问题描述

Minimal Reproducible Example最小可重复示例

Solutions:解决方案：

1 个解决方案

解决方案1 2 已采纳 2019-07-17 15:01:57

解决方案1
2 已采纳 2019-07-17 15:01:57