[英]How CUDA kernel work on multiple blocks each of which have different time consumption?
Assume that we run a kernel function with 4 blocks {b1, b2, b3, b3}.假设我们运行一个有 4 个块 {b1, b2, b3, b3} 的核函数。 Each the blocks requires {10, 2, 3, 4} amount of time to complete job.每个块需要 {10, 2, 3, 4} 时间来完成作业。 And our GPU could process only 2 blocks in parallel.而我们的 GPU 只能并行处理 2 个块。
If then, which one is correct way how our GPU work?如果是这样,哪一种是我们的 GPU 工作的正确方式?
To quote this document from Nvidia:引用 Nvidia 的这份文件:
Threadblocks are assigned to SMs线程块分配给 SM
- Assignment happens only if an SM has sufficient resources for the entire threadblock只有当 SM 有足够的资源用于整个线程块时,才会发生分配
- Resources: registers, SMEM, warp slots资源:寄存器、SMEM、warp slot
- Threadblocks that haven't been assigned wait for resources to free up尚未分配的线程块等待资源释放
- The order in which threadblocks are assigned is not defined未定义分配线程块的顺序
- Can and does vary between architectures架构之间可以并且确实会有所不同
Thus, without more information, the two scheduling are theoretically possible.因此,在没有更多信息的情况下,这两种调度在理论上是可行的。 In practice, this is even more complex since there are many SMs on a GPU and AFAIK each SM can now execute multiple blocks concurrently.在实践中,这更加复杂,因为 GPU 上有许多 SM,并且 AFAIK 现在每个 SM 可以同时执行多个块。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.