简体繁体 English

状态资源的重复是否被认为是超线程的最佳选择？

[英]Is duplication of state resources considered optimal for hyper-threading?

原文 2016-03-02 13:16:31 7 1 multithreading/ performance/ cpu/ cpu-registers/ hyperthreading

This question has an answer that says:这个问题有一个答案说：

Hyper-threading duplicates internal resources to reduce context switch time.超线程复制内部资源以减少上下文切换时间。 Resources can be: Registers, arithmetic unit, cache.资源可以是：寄存器、算术单元、缓存。

Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?为什么 CPU 设计人员最终会为同步多线程（或 Intel 上的超线程）复制状态资源？

Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?为什么不将这些相同的资源增加三倍（四倍，等等）给我们三个逻辑核心，从而提供更快的吞吐量？

Is duplication that researchers arrived at in some sense optimal , or is it just a reflection of current possibilities (transistor size, etc.)?研究人员得出的重复是否在某种意义上是最佳的，还是只是当前可能性（晶体管尺寸等）的反映？

1 个解决方案

The answer you're quoting sounds wrong.你引用的答案听起来是错误的。 Hyperthreading competitively shares the existing ALUs, cache, and physical register file.超线程竞争性地共享现有的 ALU、缓存和物理寄存器文件。

Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions.在同一个内核上同时运行两个线程可以找到更多的并行性，让这些执行单元继续工作，而不是闲置等待缓存未命中、延迟和分支预测错误。 (See Modern Microprocessors A 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.) （参见Modern Microprocessors A 90-Minute Guide！了解非常有用的背景知识和 SMT 部分。此外，此答案还了解有关现代超标量 / 乱序 CPU 如何查找和利用指令级并行性以运行 1 个以上指令的更多信息每个时钟。）

Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage).只有少数事情需要物理复制或分区以跟踪一个内核中两个 CPU 的架构状态，而且主要在前端（在发布/重命名阶段之前）。 David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. David Kanter 的 Haswell 文章展示了 Sandybridge 如何始终对 IDQ（提供问题/重命名阶段的解码 uop 队列）进行分区，但是当只有一个线程处于活动状态时，IvyBridge 和 Haswell 可以将其用作一个大队列。 He also describes how cache is competitively shared between threads.他还描述了缓存是如何在线程之间竞争共享的。 For example, a Haswell core has 168 physical integer registers , but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)例如，一个 Haswell 核心有168 个物理整数寄存器，但每个逻辑 CPU 的架构状态只需要 16 个。（每个线程的乱序执行当然受益于大量寄存器，这就是为什么寄存器重命名到一个大的物理注册文件首先完成。）

Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.有些东西是静态分区的，比如 ROB，以阻止一个线程用依赖于缓存未命中负载的工作来填充后端。

Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock.现代英特尔 CPU 具有如此多的执行单元，您只能通过仔细调整的代码勉强使它们饱和，这些代码没有任何停顿并且每个时钟运行 4 个融合域 uops。 This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.这在实践中非常罕见，在手动调整的 BLAS 库中的矩阵乘法之外。

Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each.大多数代码受益于 HT，因为它不能单独使整个内核饱和，因此单个内核的现有资源可以以每个速度快于一半的速度运行两个线程。 (Usually significantly faster than half). （通常明显快于一半）。

But when only a single thread is running, the full power of a big core is available for that thread.但是当只有一个线程在运行时，该线程可以使用大内核的全部功能。 This is what you lose out on if you design a multicore CPU that has lots of small cores.如果您设计具有许多小内核的多核 CPU，这就是您的损失。 If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread.如果英特尔 CPU 没有实现超线程，它们可能不会为单个线程包含这么多执行单元。 It helps for a few single-thread workloads, but helps a lot more with HT.它有助于一些单线程工作负载，但对 HT 的帮助更大。 So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.所以你可以争辩说这是一个复制 ALU 的情况，因为设计支持 HT，但这不是必需的。

Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Pentium 4 确实没有足够的执行资源来运行两个完整的线程而不会失去更多。 Part of this might be the trace cache, but it also didn't have nearly the amount of execution units.其中一部分可能是跟踪缓存，但它也没有几乎数量的执行单元。 P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant).带有 HT 的 P4 使得使用预取线程非常有用，这些线程只从主线程循环的数组中预取数据，如What Every Programmer should Know About Memory （其他方面仍然有用且相关）中所述/推荐。 A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread.预取线程具有较小的跟踪缓存占用空间，并提取到主线程使用的 L1D 缓存中。 This is what happens when you implement HT without enough execution resources to really make it good.这就是当你在没有足够的执行资源的情况下实现 HT 时会发生的情况。

HT doesn't help at all for code that achieves very high throughput with a single thread per physical core.对于通过每个物理内核一个线程实现非常高吞吐量的代码，HT 根本没有帮助。 For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.例如，饱和 4 uop/时钟周期的前端带宽而不会停顿。

Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators).或者，如果您的代码仅在核心的峰值 FMA 吞吐量或其他方面遇到瓶颈（使用 10 个矢量累加器保持 10 个 FMA 运行）。 It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread.对于由于与另一个线程竞争 L1D 和 L2 高速缓存中的空间而导致的额外高速缓存未命中而最终导致大量减慢的代码，它甚至会受到伤害。 (And also the uop cache and L1I cache). （还有 uop 缓存和 L1I 缓存）。

Saturating the FMAs and doing something with the results typically takes some instructions other than vfma... so high-throughput FP code is often close to saturating the front-end as well.使 FMA 饱和并处理结果通常需要一些指令而不是vfma...所以高吞吐量的 FP 代码通常也接近于使前端饱和。

Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it. Agner Fog 的 microarch pdf 也说了同样的话，关于非常仔细调整的代码没有从 HT 中受益，甚至没有受到 HT 的伤害。

Paul Clayton's comments on the question also make some good points about SMT designs in general. Paul Clayton 对这个问题的评论总体上也对 SMT 设计提出了一些很好的观点。

If you have different threads doing different things, SMT can still be helpful.如果您有不同的线程做不同的事情，SMT 仍然会有所帮助。 eg high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput.例如，高吞吐量 FP 代码与一个线程共享一个内核，该线程主要执行整数工作并在分支和缓存未命中时停顿很多，可以获得显着的整体吞吐量。 The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.低吞吐量线程会在大部分时间里让大部分内核未使用，因此运行另一个线程使用内核的其他 80% 的前端和后端资源可能会非常好。