简体   繁体   English

CUDA 内核如何在多个块上工作,每个块都有不同的时间消耗?

[英]How CUDA kernel work on multiple blocks each of which have different time consumption?

Assume that we run a kernel function with 4 blocks {b1, b2, b3, b3}.假设我们运行一个有 4 个块 {b1, b2, b3, b3} 的核函数。 Each the blocks requires {10, 2, 3, 4} amount of time to complete job.每个块需要 {10, 2, 3, 4} 时间来完成作业。 And our GPU could process only 2 blocks in parallel.而我们的 GPU 只能并行处理 2 个块。

If then, which one is correct way how our GPU work?如果是这样,哪一种是我们的 GPU 工作的正确方式?

在此处输入图像描述

To quote this document from Nvidia:引用 Nvidia 的这份文件

Threadblocks are assigned to SMs线程块分配给 SM

  • Assignment happens only if an SM has sufficient resources for the entire threadblock只有当 SM 有足够的资源用于整个线程块时,才会发生分配
    • Resources: registers, SMEM, warp slots资源:寄存器、SMEM、warp slot
    • Threadblocks that haven't been assigned wait for resources to free up尚未分配的线程块等待资源释放
  • The order in which threadblocks are assigned is not defined未定义分配线程块的顺序
    • Can and does vary between architectures架构之间可以并且确实会有所不同

Thus, without more information, the two scheduling are theoretically possible.因此,在没有更多信息的情况下,这两种调度在理论上是可行的。 In practice, this is even more complex since there are many SMs on a GPU and AFAIK each SM can now execute multiple blocks concurrently.在实践中,这更加复杂,因为 GPU 上有许多 SM,并且 AFAIK 现在每个 SM 可以同时执行多个块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM