[英]Why there are two warp schedulers in a SM of GPU?
I read NVIDIA Fermi whitepaper and get confused when I calculated the number of SP cores, schedulers. 我读了NVIDIA Fermi白皮书,当我计算SP核心,调度程序的数量时感到困惑。
According to the whitepaper, in each SM, there are two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. 根据白皮书,在每个SM中,有两个warp调度器和两个指令调度单元,允许两个warp同时发出和执行。 There are 32 SP cores in a SM, each core has a fully pipelined ALU and FPU, which is used to execute the instruction of a thread
SM中有32个SP内核,每个内核都有一个完全流水线的ALU和FPU,用于执行线程的指令
As we all know, a warp is made up by 32 threads, if we just issue a warp each cycle, that means all threads in this warp will occupy all SP cores and will finish the execution in one cycle(suppose there is no any stall). 众所周知,如果我们只是在每个周期发出一个warp,那么warp由32个线程组成,这意味着这个warp中的所有线程都将占用所有SP内核,并将在一个周期内完成执行(假设没有任何停顿) )。
However, NVIDIA devise dual scheduler, which select two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. 但是,NVIDIA设计了双调度器,它选择两个warp,并从每个warp发出一条指令到一组16个内核,16个加载/存储单元或4个SFU。
NVIDIA said this design lead to peak hardware performance. NVIDIA表示,这种设计可以带来最高的硬件性能。 Maybe the peak hardware performance comes from interleaving execution of different instruction, taking full advantage of hardware resources.
也许最高硬件性能来自交错执行不同指令,充分利用硬件资源。
My questions are as follows(suppose no memory stalls and all operands are available): 我的问题如下(假设没有内存档和所有操作数都可用):
Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler? 每个warp是否需要两个周期才能完成执行,并且每个warp调度程序将所有32个SP内核分成两组?
the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)? ld / st和SFU单元是由所有warp共享的(对于来自双调度程序的warp看起来是统一的)?
if a warp is divided into two parts, which part is scheduled first? 如果将经线分为两部分,哪部分先安排好? is there any scheduler?
有没有调度员? or just random selects one part to execute.
或者只是随机选择一个部分来执行。
what is the advantage of this design? 这种设计有什么优势? just maximize the utilization of hardware?
只是最大限度地利用硬件?
Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?
每个warp是否需要两个周期才能完成执行,并且每个warp调度程序将所有32个SP内核分成两组?
Yes. 是。 Fermi, unlike future generations, has a "hotclock" (shader clock) which runs at 2x the "core" clock.
与后代不同,费米有一个“时钟” (着色时钟),它的运行频率是“核心”时钟的2 倍 。 Each single precision floating point instruction (for example) issues over 2 "hotclocks", but to the same group of 16 SP cores.
每个单精度浮点指令(例如)发出超过2个“hotclocks”,但是发布到同一组16个SP核心。 The net effect is one issue per "core" clock per scheduler.
每个调度程序的每个“核心”时钟的净效应是一个问题。
the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?
ld / st和SFU单元是由所有warp共享的(对于来自双调度程序的warp看起来是统一的)?
Don't really understand the question. 真的不明白这个问题。 All execution resources are shared/available for instructions coming from either scheduler.
所有执行资源都可用于来自任一调度程序的指令。
if a warp is divided into two parts, which part is scheduled first?
如果将经线分为两部分,哪部分先安排好? is there any scheduler?
有没有调度员? or just random selects one part to execute.
或者只是随机选择一个部分来执行。
Why does this matter? 为什么这很重要? The machine behaves as if two complete warp instructions are scheduled in one core clock ie "dual issue".
该机器表现得好像在一个核心时钟中安排了两个完整的扭曲指令,即“双重问题”。 You don't have visibility into anything happening at the hotclock level anyway.
无论如何,您无法看到在hotclock级别发生的任何事情。
what is the advantage of this design?
这种设计有什么优势? just maximize the utilization of hardware?
只是最大限度地利用硬件?
Yes, as stated in the fermi whitepaper: 是的,正如费米白皮书中所述:
" Using this elegant model of dual-issue, Fermi achieves near peak hardware performance. "
“使用这种优雅的双重问题模型,Fermi实现了接近峰值的硬件性能。”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.