简体繁体 English

CUDA 内核如何在多个块上工作，每个块都有不同的时间消耗？

[英]How CUDA kernel work on multiple blocks each of which have different time consumption?

原文 2022-05-29 10:29:48 4 1 cuda

Assume that we run a kernel function with 4 blocks {b1, b2, b3, b3}.假设我们运行一个有 4 个块 {b1, b2, b3, b3} 的核函数。 Each the blocks requires {10, 2, 3, 4} amount of time to complete job.每个块需要 {10, 2, 3, 4} 时间来完成作业。 And our GPU could process only 2 blocks in parallel.而我们的 GPU 只能并行处理 2 个块。

If then, which one is correct way how our GPU work?如果是这样，哪一种是我们的 GPU 工作的正确方式？

1 个解决方案

To quote this document from Nvidia:引用 Nvidia 的这份文件：

Threadblocks are assigned to SMs线程块分配给 SM

Assignment happens only if an SM has sufficient resources for the entire threadblock只有当 SM 有足够的资源用于整个线程块时，才会发生分配

Resources: registers, SMEM, warp slots资源：寄存器、SMEM、warp slot

Threadblocks that haven't been assigned wait for resources to free up尚未分配的线程块等待资源释放

The order in which threadblocks are assigned is not defined未定义分配线程块的顺序

Can and does vary between architectures架构之间可以并且确实会有所不同

Thus, without more information, the two scheduling are theoretically possible.因此，在没有更多信息的情况下，这两种调度在理论上是可行的。 In practice, this is even more complex since there are many SMs on a GPU and AFAIK each SM can now execute multiple blocks concurrently.在实践中，这更加复杂，因为 GPU 上有许多 SM，并且 AFAIK 现在每个 SM 可以同时执行多个块。

这个Cuda扫描内核是仅在单个块内还是在多个块内工作？ - Does this Cuda scan kernel only work within a single block, or across multiple blocks?

并非所有模块都调用CUDA内核 - CUDA kernel not called by all blocks

如何测量 NVIDIA CUDA 中的内核时间？ - How to measure the inner kernel time in NVIDIA CUDA?

如何将多个重复的参数传递给 CUDA 内核 - How to pass multiple duplicated arguments to CUDA Kernel

在 cuda 10.0 中的多个 GPU 上运行每个内核函数 - Running each kernel function on multiple gpus in cuda 10.0

CUDA中的多个内核调用 - Multiple kernel calls in CUDA

如何评估CUDA内核的内存时间和计算时间？ - How to evaluate memory time and compute time for CUDA kernel?

CUDA线程，SMX，SP和块，它们如何工作？ - CUDA threads, SMX, SP and blocks, how do they work?

优化Cuda Kernel时间执行 - Optimize Cuda Kernel time execution

Cuda内核工作非常缓慢 - Cuda kernel work very slow

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 这个Cuda扫描内核是仅在单个块内还是在多个块内工作？ - Does this Cuda scan kernel only work within a single block, or across multiple blocks? 并非所有模块都调用CUDA内核 - CUDA kernel not called by all blocks 如何测量 NVIDIA CUDA 中的内核时间？ - How to measure the inner kernel time in NVIDIA CUDA? 如何将多个重复的参数传递给 CUDA 内核 - How to pass multiple duplicated arguments to CUDA Kernel 在 cuda 10.0 中的多个 GPU 上运行每个内核函数 - Running each kernel function on multiple gpus in cuda 10.0 CUDA中的多个内核调用 - Multiple kernel calls in CUDA 如何评估CUDA内核的内存时间和计算时间？ - How to evaluate memory time and compute time for CUDA kernel? CUDA线程，SMX，SP和块，它们如何工作？ - CUDA threads, SMX, SP and blocks, how do they work? 优化Cuda Kernel时间执行 - Optimize Cuda Kernel time execution Cuda内核工作非常缓慢 - Cuda kernel work very slow

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM