如何确保两个流中的两个内核同时发送到 GPU 运行？

Question

I am beginner in CUDA.我是 CUDA 的初学者。 I am using NVIDIA Geforce GTX 1070 and CUDA toolkit 11.3 and ubuntu 18.04.我正在使用 NVIDIA Geforce GTX 1070 和 CUDA 工具包 11.3 和 ubuntu 18.04。 As shown in the code below, I use two CPU threads to send two kernels in the form of two streams to a GPU.如下代码所示，我使用两个CPU线程将两个内核以两个流的形式发送到一个GPU。 I want exactly these two kernels to be sent to the GPU at the same time.我希望这两个内核同时发送到 GPU。 Is there a way to do this?有没有办法做到这一点？

Or at least better than what I did.或者至少比我做的更好。

Thank you in advance.先感谢您。

My code:我的代码：

//Headers
pthread_cond_t cond;
pthread_mutex_t cond_mutex;
unsigned int waiting;
cudaStream_t streamZero, streamOne;  

//Kernel zero defined here
__global__ void kernelZero(){...}

//Kernel one defined here
__global__ void kernelOne(){...}

//This function is defined to synchronize two threads when sending kernels to the GPU.
void threadsSynchronize(void) {
    pthread_mutex_lock(&cond_mutex);
    if (++waiting == 2) {
        pthread_cond_broadcast(&cond);
    } else {
        while (waiting != 2)
            pthread_cond_wait(&cond, &cond_mutex);
    }
    pthread_mutex_unlock(&cond_mutex);
}


void *threadZero(void *_) {
    // ...
    threadsSynchronize();
    kernelZero<<<blocksPerGridZero, threadsPerBlockZero, 0, streamZero>>>();
    cudaStreamSynchronize(streamZero);
    // ...
    return NULL;
}


void *threadOne(void *_) {
    // ...
    threadsSynchronize();
    kernelOne<<<blocksPerGridOne, threadsPerBlockOne, 0, streamOne>>>();
    cudaStreamSynchronize(streamOne);
    // ...
    return NULL;
}


int main(void) {
    pthread_t zero, one;
    cudaStreamCreate(&streamZero);
    cudaStreamCreate(&streamOne); 
    // ...
    pthread_create(&zero, NULL, threadZero, NULL);
    pthread_create(&one, NULL, threadOne, NULL);
    // ...
    pthread_join(zero, NULL);
    pthread_join(one, NULL);
    cudaStreamDestroy(streamZero);  
    cudaStreamDestroy(streamOne);  
    return 0;
}

Answer 1

Actually witnessing concurrent kernel behavior on a GPU has a number of requirements which are covered in other questions here on the SO cuda tag, so I'm not going to cover that ground.实际上，在 GPU 上看到并发的 kernel 行为有许多要求，这些要求在 SO cuda标签的其他问题中有所涉及，所以我不打算讨论这个问题。

Let's assume your kernels have the possibility to run concurrently.假设您的内核可以同时运行。

In that case, you're not going to do any better than this, whether you use threading or not:在这种情况下，无论您是否使用线程，您都不会做得比这更好：

cudaStream_t s1, s2;
cudaStreaCreate(&s1);
cudaStreamCreate(&s2);
kernel1<<<...,s1>>>(...);
kernel2<<<...,s2>>>(...);

If your kernels have a "long" duration (much longer than the kernel launch overhead, approximately 5-50us) then they will appear to start at "nearly" the same time.如果您的内核具有“长”持续时间（比 kernel 启动开销长得多，大约 5-50us），那么它们似乎“几乎”同时启动。 You won't do better than this by switching to threading.通过切换到线程，您不会做得比这更好。 The reason for this is not published as far as I know, so I will simply say that my own observations suggest to me that kernel launches to the same GPU are serialized by the CUDA runtime, somehow.据我所知，其原因尚未公布，所以我只想说，我自己的观察表明 kernel 启动到相同的 GPU 以某种方式由 ZA33B7755E5F9B504D2D038EACA4FF28D 运行时序列化。 You can find anecdotal evidence of this on various forums, and its fine if you don't believe me.你可以在各种论坛上找到这方面的轶事证据，如果你不相信我也没关系。 There's also no reason to assume, with CPU threading mechanisms that I am familiar with, that CPU threads execute in lockstep.对于我熟悉的 CPU 线程机制，也没有理由假设 CPU 线程同步执行。 Therefore there is no reason to assume that a threading system will cause the kernel launch in two different threads to even be reached by the host threads at the same instant in time.因此，没有理由假设线程系统会导致 kernel 在两个不同的线程中启动，甚至主机线程在同一时刻到达。

You might do a small amount better by using the cudaLaunchKernel for kernel launch, rather than the triple-chevron launch syntax: <<<...>>> , but there really is no documentation to support this claim.通过使用cudaLaunchKernel进行 kernel 启动，而不是三人字形启动语法： <<<...>>> ，您可能会做得更好一些，但实际上没有文档支持这种说法。 YMMV. YMMV。

Keep in mind that the GPU is doing its best work as a throughput processor.请记住，GPU 作为吞吐量处理器正在尽其所能。 There are no explicit mechanisms to ensure simultaneous kernel launch, and its unclear why you would need that.没有明确的机制来确保同时启动 kernel，并且不清楚您为什么需要它。

如何确保两个流中的两个内核同时发送到 GPU 运行？

问题描述

1 个解决方案

解决方案1
1 2021-12-01 14:23:45

如何确保两个流中的两个内核同时发送到 GPU 运行？

问题描述

1 个解决方案

解决方案1 1 2021-12-01 14:23:45

解决方案1
1 2021-12-01 14:23:45