简体   繁体   English

x86 uops 究竟是如何调度的?

[英]How are x86 uops scheduled, exactly?

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops 1 ) and then schedule these uops out-of-order as their inputs become ready.现代 x86 CPU 将传入的指令流分解为微操作 (uops 1 ),然后在它们的输入准备就绪时乱序调度这些 uops。 While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.虽然基本思想很清楚,但我想知道如何安排就绪指令的具体细节,因为它会影响微优化决策。

For example, take the following toy loop 2 :例如,以下面的玩具循环2为例:

top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top

this basically implements the loop (with the following correspondence: eax -> total, c -> ecx ):这基本上实现了循环(具有以下对应关系: eax -> total, c -> ecx ):

do {
  total += popcnt(c + 5);
} while (--c > 0);

I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies and so on.我熟悉通过查看uop分解,依赖链延迟等来优化任何小循环的过程。 In the loop above we have only one carried dependency chain: dec ecx .在上面的循环中,我们只有一个携带依赖链: dec ecx The first three instructions of the loop ( lea , popcnt , add ) are part of a dependency chain that starts fresh each loop.循环的前三个指令( leapopcntadd )是依赖链的一部分,每个循环都popcnt开始。

The final dec and jne are fused.最后的decjne融合在一起。 So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle.所以我们总共有 4 个融合域 uops,一个只有循环携带的依赖链,延迟为 1 个周期。 So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.因此,根据该标准,循环似乎可以执行 1 个循环/迭代。

However, we should look at the port pressure too:但是,我们也应该查看端口压力:

  • The lea can execute on ports 1 and 5 lea可以在端口 1 和 5 上执行
  • The popcnt can execute on port 1 popcnt 可以在端口 1 上执行
  • The add can execute on port 0, 1, 5 and 6 add可以在端口 0、1、5 和 6 上执行
  • The predicted-taken jnz executes on port 6预测采取的jnz在端口 6 上执行

So to get to 1 cycle / iteration, you pretty much need the following to happen:因此,要达到 1 个循环/迭代,您几乎需要执行以下操作:

  • The popcnt must execute on port 1 (the only port it can execute on) popcnt必须在端口 1(它可以执行的唯一端口)上执行
  • The lea must execute on port 5 (and never on port 1) lea必须在端口 5 上执行(而不是在端口 1 上)
  • The add must execute on port 0, and never on any of other three ports it can execute on add必须在端口 0 上执行,绝不能在它可以执行的其他三个端口中的任何一个上执行
  • The jnz can only execute on port 6 anyway jnz无论如何只能在端口 6 上执行

That's a lot of conditions!这是很多条件! If instructions just got scheduled randomly, you could get a much worse throughput.如果指令只是随机安排的,您可能会得到更糟糕的吞吐量。 For example, 75% the add would go to port 1, 5 or 6, which would delay the popcnt , lea or jnz by one cycle.例如,75% 的add将进入端口 1、5 或 6,这将使popcntleajnz延迟一个周期。 Similarly for the lea which can go to 2 ports, one shared with popcnt .同样,对于可以去 2 个端口的lea ,一个与popcnt共享。

IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:另一方面,IACA 报告的结果非常接近最佳,每次迭代 1.05 个周期:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles       Throughput Bottleneck: FrontEnd, Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  | 0.9  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea eax, ptr [ecx+0x5]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | popcnt eax, eax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add edi, eax
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | dec ecx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff4

It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles.它几乎反映了我上面提到的必要的“理想”调度,有一个小偏差:它显示了在 10 个周期中的 1 个周期中addlea窃取端口 5。 It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.它也不会知道,融合的分支要去口6,因为预计拍摄,所以它把大部分的微指令为分支端口0,而大部分的微指令的的add端口6,而反之亦然。

It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, eg, analyzing the loop over a fixed number of cycles, or just a bug or whatever.目前尚不清楚 IACA 报告的超过最优值的额外 0.05 个循环是一些深入、准确分析的结果,还是它使用的算法的洞察力较低的结果,例如,分析固定数量的循环的循环,或者只是一个错误或什么。 The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port.对于它认为将进入非理想端口的 uop 的 0.1 部分也是如此。 It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average.也不清楚是否有人解释了另一个 - 我认为错误分配端口 1 的 10 次会导致每次迭代的循环计数为 11/10 = 1.1 个循环,但我还没有计算出实际的下游结果 - 也许平均影响较小。 Or it could just be rounding (0.05 == 0.1 to 1 decimal place).或者它可能只是四舍五入(0.05 == 0.1 到 1 个小数位)。

So how do modern x86 CPUs actually schedule?那么现代 x86 CPU 是如何调度的呢? In particular:特别是:

  1. When multiple uops are ready in the reservation station, in what order are they scheduled to ports?当保留站中准备好多个微指令时,它们按什么顺序被调度到端口?
  2. When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?当一个 uop 可以去多个端口时(如上例中的addlea ),它如何决定选择哪个端口?
  3. If any of the answers involve a concept like oldest to choose among uops, how is it defined?如果任何答案都涉及在 uops 中选择最旧的概念,那么它是如何定义的? Age since it was delivered to the RS?自交付给 RS 以来的年龄? Age since it became ready?准备好后的年龄? How are ties broken?关系是怎么断的? Does program order ever come into it?是否有程序顺序?

Results on Skylake Skylake 上的结果

Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf ) on my Skylake box.让我们在 Skylake 上测量一些实际结果,以检查哪些答案解释了实验证据,因此这里是我的 Skylake 盒子上的一些真实世界测量结果(来自perf )。 Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination.令人困惑的是,我将转而将imul用于我的“仅在一个端口上执行”指令,因为它有许多变体,包括 3 参数版本,允许您对源和目标使用不同的寄存器。 This is very handy when trying to construct dependency chains.这在尝试构建依赖链时非常方便。 It also avoids the whole "incorrect dependency on destination" that popcnt has.它还避免了popcnt具有的整个“对目的地的错误依赖”。

Independent Instructions独立指令

Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.让我们先看看指令相对独立的简单 (?) 情况——除了像循环计数器这样的琐碎链之外,没有任何依赖链。

Here's a 4 uop loop (only 3 executed uops) with mild pressure.这是一个带有轻微压力的 4 uop 循环(仅执行 3 个 uop)。 All instructions are independent (don't share any sources or destinations).所有指令都是独立的(不共享任何来源或目的地)。 The add could in principle steal the p1 needed by the imul or p6 needed by the dec: add原则上可以窃取imul所需的p1或 dec 所需的p6

Example 1示例 1

instr   p0 p1 p5 p6 
xor       (elim)
imul        X
add      X  X  X  X
dec               X

top:
    xor  r9, r9
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

The results is that this executes with perfect scheduling at 1.00 cycles / iteration:

   560,709,974      uops_dispatched_port_port_0                                     ( +-  0.38% )
 1,000,026,608      uops_dispatched_port_port_1                                     ( +-  0.00% )
   439,324,609      uops_dispatched_port_port_5                                     ( +-  0.49% )
 1,000,041,224      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,110      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,281,902      cycles:u   

                                           ( +-  0.00% )

As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports.正如预期的那样, p1p6分别被imuldec/jnz充分利用,然后add问题大约占剩余可用端口的一半。 Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation).粗略地注意 - 实际比率是 56% 和 44%,并且这个比率在运行中非常稳定(注意+- 0.49%变化)。 If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment).如果我调整循环对齐,拆分会发生变化(32B 对齐为 53/46,32B+4 对齐更像是 57/42)。 Now, we if change nothing except the position of imul in the loop:现在,除了imul在循环中的位置之外,我们什么都不改变:

Example 2示例 2

top:
    imul rax, rbx, 5
    xor  r9, r9
    add  r8, rdx
    dec esi
    jnz top

Then suddenly the p0 / p5 split is exactly 50%/50%, with 0.00% variation:然后突然p0 / p5分裂正好是 50%/50%,有 0.00% 的变化:

   500,025,758      uops_dispatched_port_port_0                                     ( +-  0.00% )
 1,000,044,901      uops_dispatched_port_port_1                                     ( +-  0.00% )
   500,038,070      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,066,733      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,439      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,439,396      cycles:u                                                        ( +-  0.01% )

So that's already interesting, but it's hard to tell what's going on.所以这已经很有趣了,但很难说到底发生了什么。 Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (eg, because counters are used).也许确切的行为取决于循环入口处的初始条件,并且对循环内的排序很敏感(例如,因为使用了计数器)。 This example shows that something more than "random" or "stupid" scheduling is going on.这个例子表明正在发生的不仅仅是“随机”或“愚蠢”的调度。 In particular, if you just eliminate the imul instruction from the loop, you get the following:特别是,如果您只是从循环中消除imul指令,则会得到以下结果:

Example 3示例 3

   330,214,329      uops_dispatched_port_port_0                                     ( +-  0.40% )
   314,012,342      uops_dispatched_port_port_1                                     ( +-  1.77% )
   355,817,739      uops_dispatched_port_port_5                                     ( +-  1.21% )
 1,000,034,653      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,000,160      instructions:u            #    4.00  insns per cycle          ( +-  0.00% )
 1,000,235,522      cycles:u                                                      ( +-  0.00% )

Here, the add is now roughly evenly distributed among p0 , p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.在这里, add现在大致均匀地分布在p0p1p5 - 因此imul的存在确实影响了add调度:它不仅仅是某些“避免端口 1”规则的结果。

Note here that total port pressure is only 3 uops/cycle, since the xor is a zeroing idiom and is eliminated in the renamer.请注意,总端口压力仅为 3 uop/周期,因为xor是一个归零习语,并在重命名器中被消除。 Let's try with the max pressure of 4 uops.让我们尝试使用 4 uop 的最大压力。 I expect whatever mechanism kicked in above to able to perfectly schedule this also.我希望上面启动的任何机制也能够完美地安排它。 We only change xor r9, r9 to xor r9, r10 , so it is no longer a zeroing idiom.我们只将xor r9, r9更改为xor r9, r10 ,因此它不再是归零习语。 We get the following results:我们得到以下结果:

Example 4示例 4

top:
    xor  r9, r10
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

       488,245,238      uops_dispatched_port_port_0                                     ( +-  0.50% )
     1,241,118,197      uops_dispatched_port_port_1                                     ( +-  0.03% )
     1,027,345,180      uops_dispatched_port_port_5                                     ( +-  0.28% )
     1,243,743,312      uops_dispatched_port_port_6                                     ( +-  0.04% )
     5,000,000,711      instructions:u            #    2.66  insns per cycle            ( +-  0.00% )
     1,880,606,080      cycles:u                                                        ( +-  0.08% )

Oops!哎呀! Rather than evenly scheduling everything across p0156 , the scheduler has underused p0 (it's only executing something ~49% of cycles), and hence p1 and p6 are oversubcribed because they are executing both their required ops of imul and dec/jnz .调度程序没有在p0156均匀调度所有内容, p0156未充分利用p0 (它只执行约 49% 的周期),因此p1p6被过度订阅,因为它们正在执行所需imuldec/jnz This behavior, I think is consistent with a counter-based pressure indicator as hayesti indicated in their answer, and with uops being assigned to a port at issue-time, not at execution time as both hayesti and Peter Cordes mentioned.我认为这种行为与 hayesti 在他们的回答中指出的基于计数器的压力指示器一致,并且在发布时将 uops分配给端口,而不是在hayesti 和 Peter Cordes 提到的执行时 That behavior 3 makes the execute the oldest ready uops rule not nearly as effective.行为3使得执行最旧的就绪 uops规则几乎没有那么有效。 If uops weren't bound to execution ports at issue, but rather at execution, then this "oldest" rule would fix the problem above after one iteration - once one imul and one dec/jnz got held back for a single iteration, they will always be older than the competing xor and add instructions, so should always get scheduled first.如果 uops 没有绑定到有问题的执行端口,而是在执行时绑定,那么这个“最古老的”规则将在一次迭代后解决上面的问题——一旦一个imul和一个dec/jnz被阻止进行一次迭代,他们将总是比竞争的xor更旧并add指令,所以应该总是首先安排。 One thing I am learning though, is that if ports are assigned at issue time, this rule doesn't help because the ports are pre-determined at issue time.不过,我正在学习的一件事是,如果端口是在发布时分配的,则此规则无济于事,因为端口是在发布时预先确定的。 I guess it still helps a bit in favoring instructions which are part of long dependecy chains (since these will tend to fall behind), but it's not the cure-all I thought it was.我想它仍然有助于支持作为长依赖链一部分的指令(因为它们往往会落后),但这并不是我认为的万能药。

That also seems to be a explain the results above: p0 gets assigned more pressure than it really has because the dec/jnz combo can in theory execute on p06 .这也似乎是一个解释上述结果: p0被分配更多的压力比它确实有因为dec/jnz组合在理论上可以上执行p06 In fact because the branch is predicted taken it only ever goes to p6 , but perhaps that info can't feed into the pressure balancing algorithm, so the counters tend to see equal pressure on p016 , meaning that the add and the xor get spread around differently than optimal.事实上,因为分支被预测采取它只去p6 ,但也许该信息不能提供给压力平衡算法,所以计数器往往会看到p016上的相等压力,这意味着addxor得到传播与最优不同。

Probably we can test this, by unrolling the loop a bit so the jnz is less of a factor...也许我们可以通过稍微展开循环来测试这一点,这样jnz就不是一个因素......


1 OK, it is properly written μops , but that kills search-ability and to actually type the "μ" character I'm usually resorting to copy-pasting the character from a webpage. 1好的,它写得正确μops ,但这会扼杀搜索能力并实际输入“μ”字符,我通常求助于从网页复制粘贴该字符。

2 I had originally used imul instead of popcnt in the loop, but, unbelievably, _IACA doesn't support it_ ! 2我最初在循环中使用imul而不是popcnt ,但令人难以置信的是,_IACA 不支持它_!

3 Please note that I'm not suggesting this is a poor design or anything - there are probably very good hardware reasons why the scheduler cannot easily make all its decisions at execution time. 3请注意,我并不是在暗示这是一个糟糕的设计或任何东西 - 可能有很好的硬件原因导致调度程序无法在执行时轻松做出所有决定。

Your questions are tough for a couple of reasons:您的问题很棘手,原因如下:

  1. The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation.答案在很大程度上取决于处理器的微体系结构,这可能会因代而异。
  2. These are fine-grained details which Intel doesn't generally release to the public.这些是英特尔通常不会向公众发布的细粒度细节。

Nevertheless, I'll try to answer...不过,我会尽量回答...

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?当保留站中准备好多个微指令时,它们按什么顺序被调度到端口?

It should be the oldest [see below], but your mileage may vary.应该是最古老的 [见下文],但您的里程可能会有所不同。 The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); P6 微体系结构(在 Pentium Pro、2 和 3 中使用)使用带有五个调度程序(每个执行端口一个)的保留站; the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch.调度程序使用优先级指针作为开始扫描准备好要调度的 uops 的位置。 It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled.它只是伪 FIFO,因此完全有可能并不总是安排最旧的就绪指令。 In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead.在 NetBurst 微体系结构(在 Pentium 4 中使用)中,他们放弃了统一保留站,而是使用两个 uop 队列。 These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction.这些是适当的折叠优先级队列,因此可以保证调度程序获得最旧的就绪指令。 The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this.核心架构返回到一个保留站,我会冒险猜测他们使用了折叠优先级队列,但我找不到证实这一点的来源。 If somebody has a definitive answer, I'm all ears.如果有人有明确的答案,我会全神贯注。

When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?当一个 uop 可以去多个端口时(如上例中的 add 和 lea),它是如何决定选择哪个端口的?

That's tricky to know.这很难知道。 The best I could find is a patent from Intel describing such a mechanism.我能找到的最好的是来自英特尔的专利,描述了这种机制。 Essentially, they keep a counter for each port that has redundant functional units.本质上,它们为每个具有冗余功能单元的端口保留一个计数器。 When the uops leave the front end to the reservation station, they are assigned a dispatch port.当 uops 离开前端到保留站时,它们会被分配一个调度端口。 If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly.如果必须在多个冗余执行单元之间做出决定,则使用计数器来平均分配工作。 Counters are incremented and decremented as uops enter and leave the reservation station respectively.计数器随着微指令分别进入和离开保留站而递增和递减。

Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example.当然,这只是一种启发式方法,并不能保证完美的无冲突时间表,但是,我仍然可以看到它与您的玩具示例一起使用。 The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.只能到达一个端口的指令最终会影响调度程序将“限制较少”的 uops 分派到其他端口。

In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)在任何情况下,专利的存在并不一定意味着该想法被采用(尽管如此说,其中一位作者也是奔腾 4 的技术负责人,所以谁知道呢?)

If any of the answers involve a concept like oldest to choose among uops, how is it defined?如果任何答案都涉及在 uops 中选择最旧的概念,那么它是如何定义的? Age since it was delivered to the RS?自交付给 RS 以来的年龄? Age since it became ready?准备好后的年龄? How are ties broken?关系是怎么断的? Does program order ever come into it?程序顺序有没有出现过?

Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, ie oldest in program order.由于微指令是按顺序插入保留站的,这里最老的确实是指它进入保留站的时间,即程序顺序最旧。

By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware.顺便说一句,我会对那些 IACA 结果持怀疑态度,因为它们可能无法反映真实硬件的细微差别。 On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7.在 Haswell 上,有一个名为uops_executed_port的硬件计数器,它可以告诉您线程中有多少个周期是端口 0-7 的uops问题。 Maybe you could leverage these to get a better understanding of your program?也许您可以利用这些来更好地了解您的程序?

Here's what I found on Skylake, coming at it from the angle that uops are assigned to ports at issue time (ie, when they are issued to the RS), not at dispatch time (ie, at the moment they are sent to execute) .这是我在 Skylake 上发现的,从uops 在发布时间(即,当它们被发布到 RS 时)而不是在调度时间(即,在它们被发送执行时)分配给端口的角度来看。 . Before I had understood that the port decision was made at dispatch time.在我明白港口决定是在派送时做出的。

I did a variety of tests which tried to isolate sequences of add operations that can go to p0156 and imul operations which go only to port 0. A typical test goes something like this:我进行了各种测试,试图隔离可以进入p0156add操作序列和仅进入端口 0 的imul操作。一个典型的测试是这样的:

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

Basically there is a long lead-in of mov eax, [edi] instructions, which only issue on p23 and hence don't clog up the ports used by the instructions (I could have also used nop instructions, but the test would be a bit different since nop don't issue to the RS).基本上有很长的mov eax, [edi]指令引入,它们只在p23发出,因此不会阻塞指令使用的端口(我也可以使用nop指令,但测试将是一个有点不同,因为nop不向 RS 发出问题)。 This is followed by the "payload" section, here composed of 4 imul and 12 add , and then a lead-out section of more dummy mov instructions.接下来是“有效载荷”部分,这里由 4 个imul和 12 个add ,然后是更多虚拟mov指令的引出部分。

First, let's take a look at the patent that hayesti linked above, and which he describes the basic idea about: counters for each port that track the total number of uops assigned to the port, which are used to load balance the port assignments.首先,让我们看看上面 hayesti 链接的专利,他描述了其基本思想:每个端口的计数器,用于跟踪分配给端口的 uop 总数,用于对端口分配进行负载平衡。 Take a look at this table included in the patent description:看看专利描述中包含的这张表:

在此处输入图片说明

This table is used to pick between p0 or p1 for the 3-uops in an issue group for the 3-wide architecture discussed in the patent.该表用于为专利中讨论的 3-wide 架构的问题组中的 3-uop 在p0p1之间进行选择。 Note that the behavior depends on the position of the uop in the group , and that there are 4 rules 1 based on the count, which spread the uops around in a logical way.请注意,行为取决于uop 在 group 中的位置,并且有 4基于计数的规则1 ,它们以合乎逻辑的方式散布 uop。 In particular, the count needs to be at +/- 2 or greater before the whole group gets assigned the under-used port.特别是,在整个组被分配到未充分使用的端口之前,计数需要为 +/- 2 或更大。

Let's see if we can observe the "position in issue group" matters behavior on Sklake.让我们看看我们是否可以在 Sklake 上观察“在问题组中的位置”问题的行为。 We use a payload of a single add like:我们使用单个add的有效负载,例如:

add edx, 1     ; position 0
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... and we slide it around inside the 4 instruction chuck like: ...我们将其在 4 指令卡盘内滑动,例如:

mov eax, [edi]
add edx, 1      ; position 1
mov eax, [edi]
mov eax, [edi]

... and so on, testing all four positions within the issue group 2 . ... 以此类推,测试问题组2 中的所有四个位置。 This shows the following, when the RS is full (of mov instructions) but with no port pressure of any of the relevant ports:当 RS 已满(包含mov指令)但没有任何相关端口的端口压力时,这将显示以下内容:

  • The first add instructions go to p5 or p6 , with the port selected usually alternating as the instruction is slow down (ie, add instructions in even positions go to p5 and in odd positions go to p6 ).第一个add指令转到p5p6 ,选择的端口通常随着指令变慢而交替(即,偶数位置的add指令转到p5 ,奇数位置的add指令转到p6 )。
  • The second add instruction also goes to p56 - whichever of the two the first one didn't go to.第二个add指令也转到p56 - 第一个没有转到的两个中的那个。
  • After that further add instructions start to be balanced around p0156 , with p5 and p6 usually ahead but with things fairly even overall (ie, the gap between p56 and the other two ports doesn't grow).在那之后,进一步的add指令开始在p0156周围平衡, p5p6通常在前面,但总体上相当均匀(即p56和其他两个端口之间的差距没有增加)。

Next, I took a look at what happens if load up p1 with imul operations, then first in a bunch of add operations:接下来,我看看如果用imul操作加载p1会发生什么,然后首先在一堆add操作中:

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

The results show that the scheduler handles this well - all of the imul got to scheduled to p1 (as expected), and then none of the subsequent add instructions went to p1 , being spread around p056 instead.结果表明调度程序处理得很好——所有imul被调度到p1 (如预期的那样),然后没有后续的add指令进入p1 ,而是在p056周围传播。 So here the scheduling is working well.所以这里的调度运行良好。

Of course, when the situation is reversed, and the series of imul comes after the add s, p1 is loaded up with its share of adds before the imul s hit.当然,当情况相反,并且imul序列在add s 之后出现时, p1imul s 命中之前加载了它的add份额。 That's a result of port assignment happening in-order at issue time, since is no mechanism to "look ahead" and see the imul when scheduling the add s.这是端口分配在发布时按顺序发生的结果,因为在调度add时没有“向前看”和查看imul的机制。

Overall the scheduler looks to do a good job in these test cases.总体而言,调度程序看起来在这些测试用例中做得很好。

It doesn't explain what happens in smaller, tighter loops like the following:它没有解释在更小、更紧密的循环中会发生什么,如下所示:

sub r9, 1
sub r10, 1
imul ebx, edx, 1
dec ecx
jnz top

Just like Example 4 in my question, this loop only fills p0 on ~30% of cycles, despite there being two sub instructions that should be able to go to p0 on every cycle.就像我的问题中的示例 4一样,尽管有两个sub指令应该能够在每个周期转到p0 ,但该循环仅在大约 30% 的周期内填充p0 p1 and p6 are oversubscribed, each executing 1.24 uops for every iteration (1 is ideal). p1p6被超额订阅,每个迭代执行 1.24 uop(1 是理想的)。 I wasn't able to triangulate the difference between the examples that work well at the top of this answer with the bad loops - but there are still many ideas to try.我无法对在此答案顶部运行良好的示例与坏循环之间的差异进行三角测量 - 但仍有许多想法可以尝试。

I did note that examples without instruction latency differences don't seem to suffer from this issue.我确实注意到没有指令延迟差异的示例似乎不会受到这个问题的影响。 For example, here's another 4-uop loop with "complex" port pressure:例如,这是另一个具有“复杂”端口压力的 4-uop 循环:

top:
    sub r8, 1
    ror r11, 2
    bswap eax
    dec ecx
    jnz top

The uop map is as follows: uop图如下:

instr   p0 p1 p5 p6 
sub      X  X  X  X
ror      X        X
bswap       X  X   
dec/jnz           X

So the sub must always go to p15 , shared with bswap if things are to work out.因此sub必须始终转到p15 ,如果要解决问题,则与bswap共享。 They do:他们是这样:

Performance counter stats for './sched-test2' (2 runs): './sched-test2' 的性能计数器统计信息(2 次运行):

   999,709,142      uops_dispatched_port_port_0                                     ( +-  0.00% )
   999,675,324      uops_dispatched_port_port_1                                     ( +-  0.00% )
   999,772,564      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,991,020      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,238,468      uops_issued_any                                               ( +-  0.00% )
 5,000,000,117      instructions:u            #    4.99  insns per cycle          ( +-  0.00% )
 1,001,268,722      cycles:u                                                      ( +-  0.00% )

So it seems that the issue may be related to instruction latencies (certainly, there are other differences between the examples).如此看来,该问题可能与指令延迟(当然,也有实例之间的其他差异)。 That's something that came up in this similar question .这是在这个类似问题中提出的


1 The table has 5 rules, but the rule for 0 and -1 counts are identical. 1该表有 5 个规则,但 0 和 -1 计数的规则是相同的。

2 Of course, I can't be sure where the issue groups start and end, but regardless we test four different positions as we slide down four instructions (but the labels could be wrong). 2当然,我不能确定问题组从哪里开始和结束,但不管我们在滑下四个指令时测试四个不同的位置(但标签可能是错误的)。 I'm also not sure the issue group max size is 4 - earlier parts of the pipeline are wider - but I believe it is and some testing seemed to show it was (loops with a multiple of 4 uops showed consistent scheduling behavior).我也不确定问题组的最大大小是 4 - 管道的早期部分更宽 - 但我相信它是,并且一些测试似乎表明它是(具有 4 uop 倍数的循环显示一致的调度行为)。 In any case, the conclusions hold with different scheduling group sizes.在任何情况下,结论都适用于不同的调度组大小。

Section 2.12 of Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures [^1] explains how port are assigned, though it fails to explain example 4 in the question description.最近英特尔微架构上基本块准确吞吐量预测[^1] 的第 2.12 节解释了端口的分配方式,尽管它未能解释问题描述中的示例 4。 I also failed to figure out what role Latency plays in the port assignment.我也没有弄清楚延迟在端口分配中扮演什么角色。

Previous work [19, 25, 26] has identified the ports that the µops of individual instructions can use.之前的工作 [19, 25, 26] 已经确定了单个指令的微操作可以使用的端口。 For µops that can use more than one port, it was, however, previously unknown how the actual port is chosen by the processor.然而,对于可以使用多个端口的微操作,处理器如何选择实际端口是未知的。 We reverse-engineered the port assignment algorithm using microbenchmarks.我们使用微基准对端口分配算法进行了逆向工程。 In the following, we describe our findings for CPUs with eight ports;下面,我们将描述我们对具有 8 个端口的 CPU 的发现; such CPUs are currently most widely used.此类 CPU 目前使用最为广泛。

The ports are assigned when the µops are issued by the renamer to the scheduler.当重命名程序向调度程序发出 µop 时,就会分配端口。 In a single cycle, up to four µops can be issued.在单个周期中,最多可以发出 4 个 µop。 In the following, we will call the position of a µop within a cycle an issue slot;在下文中,我们将循环中 µop 的位置称为发布槽; eg, the oldest instruction issued in a cycle would occupy issue slot 0.例如,一个周期中发出的最旧指令将占用发出槽 0。

The port that a µop is assigned depends on its issue slot and on the ports assigned to µops that have not been executed and were issued in a previous cycle. µop 分配的端口取决于它的发布槽和分配给尚未执行且在前一个周期发布的 µop 的端口。

In the following, we will only consider µops that can use more than one port.在下文中,我们将只考虑可以使用多个端口的微操作。 For a given µop m, let $P_{min}$ be the port to which the fewest non-executed µops have been assigned to from among the ports that m can use.对于给定的 µop m,让 $P_{min}$ 是 m 可以使用的端口中分配到的最少非执行 µop 的端口。 Let $P_{min'}$ be the port with the second smallest usage so far.令 $P_{min'}$ 是目前使用量第二小的端口。 If there is a tie among the ports with the smallest (or second smallest, respectively) usage, let $P_{min}$ (or $P_{min'}$) be the port with the highest port number from among these ports (the reason for this choice is probably that ports with higher numbers are connected to fewer functional units).如果使用量最小(或分别为第二小)的端口之间存在联系,则让 $P_{min}$(或 $P_{min'}$)是这些端口中端口号最高的端口(这种选择的原因可能是编号较大的端口连接到较少的功能单元)。 If the difference between $P_{min}$ and $P_{min'}$ is greater or equal to 3, we set $P_{min'}$ to $P_{min}$.如果 $P_{min}$ 和 $P_{min'}$ 之间的差值大于或等于 3,我们将 $P_{min'}$ 设置为 $P_{min}$。

The µops in issue slots 0 and 2 are assigned to port $P_{min}$ The µops in issue slots 1 and 3 are assigned to port $P_{min'}$.发布时隙 0 和 2 中的微操作分配给端口 $P_{min}$ 发布时隙 1 和 3 中的微操作分配给端口 $P_{min'}$。

A special case is µops that can use port 2 and port 3. These ports are used by µops that handle memory accesses, and both ports are connected to the same types of functional units.一个特例是可以使用端口 2 和端口 3 的 µops。这些端口由处理内存访问的 µops 使用,并且两个端口都连接到相同类型的功能单元。 For such µops, the port assignment algorithm alternates between port 2 and port 3.对于此类微操作,端口分配算法在端口 2 和端口 3 之间交替。

I tried to find out whether $P_{min}$ and $P_{min'}$ are shared between threads (Hyper-Threading), namely whether one thread can affect the port assignment of another one in the same core.我试图找出 $P_{min}$ 和 $P_{min'}$ 是否在线程之间共享(超线程),即一个线程是否会影响同一内核中另一个线程的端口分配。

Just split the code used in BeeOnRope's answer into two threads.只需将 BeeOnRope 的答案中使用的代码拆分为两个线程即可。

thread1:
.loop:
    imul rax, rbx, 5
    jmp .loop

thread2:
    mov esi,1000000000
    .top:
    bswap eax
    dec  esi
    jnz  .top
    jmp thread2

Where instructions bswap can be executed on ports 1 and 5, and imul r64, R64, i on port 1. If counters were shared between threads, you would see bswap executed on port 5 and imul executed on port 1.其中指令bswap可以在端口 1 和 5 上执行, imul r64, R64, i在端口 1 上执行。如果计数器在线程之间共享,您会看到bswap在端口 5 上执行,而imul在端口 1 上执行。

The experiment was recorded as follows, where ports P0 and P5 on thread 1 and p0 on thread 2 should have recorded a small amount of non-user data, but without hindering the conclusion.实验记录如下,其中线程1上的端口P0和P5和线程2上的p0应该记录了少量非用户数据,但不妨碍得出结论。 It can be seen from the data that the bswap instruction of thread 2 is executed alternately between ports P1 and P5 without giving up P1.从数据可以看出,线程2的bswap指令在端口P1和P5之间交替执行,没有放弃P1。

port港口 thread 1 active cycles线程 1 活动周期 thread 2 active cycles线程 2 个活动周期
P0 P0 63,088,967 63,088,967 68,022,708 68,022,708
P1 P1 180,219,013,832 180,219,013,832 95,742,764,738 95,742,764,738
P5 P5 63,994,200 63,994,200 96,291,124,547 96,291,124,547
P6 P6 180,330,835,515 180,330,835,515 192,048,880,421 192,048,880,421
total全部的 180,998,504,099 180,998,504,099 192,774,759,297 192,774,759,297

Therefore, the counters are not shared between threads.因此,计数器不在线程之间共享。

This conclusion does not conflict with SMotherSpectre[^2], which uses time as the side channel.这个结论与SMotherSpectre[^2]并不冲突,SMotherSpectre[^2]以时间作为旁道。 (For example, thread 2 waits longer on port 1 to use port 1.) (例如,线程 2 在端口 1 上等待更长的时间才能使用端口 1。)

Executing instructions that occupy a specific port and measuring their timing enables inference about other instructions executing on the same port.执行占用特定端口的指令并测量它们的时序可以推断在同一端口上执行的其他指令。 We first choose two instructions, each scheduled on a single, distinct, execution port.我们首先选择两条指令,每条指令都安排在一个单独的、不同的执行端口上。 One thread runs and times a long sequence of single µop instructions scheduled on port a, while simultaneously the other thread runs a long sequence of instructions scheduled on port b.一个线程运行并计时在端口 a 上调​​度的一长串单个 µop 指令,同时另一个线程运行在端口 b 上调度的一长串指令。 We expect that, if a = b, contention occurs and the measured execution time is longer compared to the a ≠ b case.我们预计,如果 a = b,则发生争用,并且与 a ≠ b 的情况相比,测量的执行时间更长。


[^1]: Abel, Andreas, and Jan Reineke. [^1]:Abel、Andreas 和 Jan Reineke。 "Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures." “最新英特尔微架构上基本块的准确吞吐量预测。” arXiv preprint arXiv:2107.14210 (2021). arXiv 预印本 arXiv:2107.14210 (2021)。

[^2]: Bhattacharyya, Atri, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. [^2]:Bhattacharyya、Atri、Alexandra Sandulescu、Matthias Neugschwandtner、Alessandro Sorniotti、Babak Falsafi、Mathias Payer 和 Anil Kurmus。 “SMoTherSpectre: Exploiting Speculative Execution through Port Contention.” “SMoTherSpectre:通过端口争用利用投机执行。” Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, November 6, 2019, 785–800. 2019 年 ACM SIGSAC 计算机和通信安全会议论文集,2019 年 11 月 6 日,785-800。 https://doi.org/10.1145/3319535.3363194 . https://doi.org/10.1145/3319535.3363194

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM