简体繁体 English

从Intel Xeon Phi的AVX512到Intel i5-8259U的AVX2，将会失去什么？

[英]What is lost in going from AVX512 on Intel Xeon Phi to AVX2 on Intel i5-8259U?

原文 2019-06-10 13:22:56 7 1 compiler-optimization/ icc/ avx2/ xeon-phi/ avx512

Trying to follow a course on Coursera , I tried to optimize a sample C++ code for my Intel i5-8259U CPU which I believe supports AVX2 SIMD instructions set. 尝试按照Coursera上的课程学习，我尝试为我的Intel i5-8259U CPU优化示例C ++代码，我相信它支持AVX2 SIMD指令集。 Now, AVX2 supplies 16 registers per core (called YMM0 , YMM1 , ..., YMM15 ) which are 256-bit wide, meaning that each can process up to 4 double precision floating point numbers simultaneously. 现在， AVX2每个内核提供16个寄存器（称为YMM0 ， YMM1 ，...， YMM15 ），它们为256位宽，这意味着每个内核可以同时处理多达4个双精度浮点数。 Taking advantage of AVX2 SIMD instructions should optimise my code to run up to 4 times faster compared to scalar instructions. 与标量指令相比，利用AVX2 SIMD指令可以优化我的代码，使其运行速度提高4倍。

In the linked course, you can try running the same code for numerical integration on an Intel Xeon Phi 7210 (Knights Landing) processor that supports AVX512 which uses 512-bit wide registers. 在链接的课程中，您可以尝试在支持AVX512 （使用512位宽的寄存器Intel Xeon Phi 7210 (Knights Landing)的Intel Xeon Phi 7210 (Knights Landing)处理器上运行相同的代码进行数值积分。 That means we should expect a speed up of double precision operations by a factor of 8. Indeed the code used by the instructor obtains optimisations up to a factor of 14, which is almost 173% of 8. The additional optimisation is due to OpenMP. 这意味着我们应该期望将双精度运算的速度提高8倍。的确，教师使用的代码获得的优化最高达14倍，几乎是8倍的173％。额外的优化归功于OpenMP。

In order to run the same code on my CPU, the only thing I changed was the optimisation flag passed to the Intel compiler: instead of -xMIC-AVX512 , I used -xCORE-AVX2 . 为了在CPU上运行相同的代码，我唯一更改的是将优化标志传递给Intel编译器：我使用了-xCORE-AVX2而不是-xMIC-AVX512 。 The speed up I obtained is only a factor of 2 which is a measly 50% of the expected speed up due only to SIMD vectorisation on 256-bit registers. 我获得的速度仅为2的因数，仅是由于256位寄存器上的SIMD向量化而导致的预期速度的50％左右。 Compare this 50% to the 173% obtained on an Intel Xeon Phi processor. 将这50％与在Intel Xeon Phi处理器上获得的173％进行比较。

Why do I see this drastic loss in performance just by moving from AVX512 to AVX2 ? 为什么仅从AVX512到AVX2看到性能急剧下降？ Surely, something other than SIMD optimisation is at play here. 当然，这里除了SIMD优化之外还有其他事情。 What am I missing? 我想念什么？

PS You can find the referenced code in the folder integral/solutions/1-simd/ here . 附注：您可以在此处的文件夹“ integral/solutions/1-simd/找到引用的代码。

1 个解决方案

TL:DR: KNL (Knight's Landing) is only good at running code specifically compiled for it, and thus gets a much bigger speedup because it stumbles badly running "generic" code. TL：DR： KNL（骑士登陆号）只擅长运行专门为其编译的代码，因此获得了更大的加速，因为它偶然发现了糟糕的“通用”代码。

Coffee Lake only gets a speedup of 2 from 128-bit SSE2 to 256-bit AVX, running both "generic" and targeted code optimally. Coffee Lake从128位SSE2到256位AVX只能获得2倍的加速，同时可以最佳地运行“通用”代码和目标代码。

Mainstream CPUs like Coffee Lake are one of the targets that "generic" tuning in modern compilers cares about, and they don't have many weaknesses in general. 诸如Coffee Lake之类的主流CPU是现代编译器关注的目标之一，它们通常没有很多缺点。 But KNL isn't; 但是KNL不是； ICC without any options doesn't care about KNL 没有任何选择的ICC不在乎KNL

You're assuming that the baseline for your speedups is scalar . 您假设加速的基准是scalar 。 But without any options like -march=native or -xCORE-AVX2 , Intel's compiler (ICC) will still auto-vectorize with SSE2, because that's baseline for x86-64. 但是，如果没有任何选项-march=native或-xCORE-AVX2 ，英特尔的编译器（ICC）仍将使用SSE2自动矢量化，因为这是x86-64的基准。

-xCORE-AVX2 doesn't enable auto-vectorization, it just gives auto-vectorization even more instructions to play with. -xCORE-AVX2没有启用自动矢量化，它只是为自动矢量化提供了更多-xCORE-AVX2指令。 Optimization level (including auto-vectorization) is controlled by -O0 / -O2 / -O3 , and for FP by strict vs. fast fp-model . 优化级别（包括自动矢量化）由-O0 / -O2 / -O3 ，对于FP，则由严格vs快速fp-model 。 Intel's compiler defaults to full optimization with -fp-model fast=1 (one level below fast=2 ), so it's something like gcc -O3 -ffast-math . 英特尔的编译器默认使用-fp-model fast=1 （在fast=2以下一个级别）进行完全优化，因此它类似于gcc -O3 -ffast-math 。

But without extra options, it can only use the baseline instruction-set, which for x86-64 is SSE2. 但是没有其他选项，它只能使用基线指令集，对于x86-64，它是SSE2。 That's still better than scalar. 那还是比标量好。

SSE2 uses 128-bit XMM registers for packed double math, with the same instruction throughput as AVX (on your i5 Coffee Lake) but half the amount of work per instruction . SSE2使用128位XMM寄存器进行双精度数学运算，其指令吞吐量与AVX（在i5 Coffee Lake上）相同，但每条指令的工作量只有一半 。 (And it doesn't have FMA, so the compiler couldn't contract any mul+add operations in your source into FMA instructions that way it could with AVX+FMA). （而且它没有FMA，因此编译器无法像使用AVX + FMA那样将源中的任何mul + add操作都压缩为FMA指令）。

So a factor of 2 speedup on your Coffee Lake CPU is exactly what you should expect for a simple problem that purely bottlenecks on vector mul/add/FMA SIMD throughput (not memory / cache or anything else). 因此，您的Coffee Lake CPU速度提高了2倍，这正是您对一个简单的问题的期望，该问题纯粹是矢量mul / add / FMA SIMD吞吐量的瓶颈（不是内存/缓存或其他任何东西）。

Speedup depends on what your code is doing. 加速取决于您的代码在做什么。 If you bottleneck on memory or cache bandwidth, wider registers only help a bit to better utilize memory parallelism and keep it saturated. 如果您在内存或缓存带宽上遇到瓶颈，那么更宽的寄存器只会对更好地利用内存并行性并使它保持饱和有一点帮助。

And AVX + AVX2 add more powerful shuffles and blends and other cool stuff, but for simple problems with pure vertical SIMD that doesn't help. 而且，AVX + AVX2添加了更强大的混洗和混合功能以及其他很棒的功能，但对于纯垂直SIMD所带来的简单问题却无济于事。

So the real question is Why does AVX512 help by more than 4x on KNL? 因此，真正的问题是， 为什么AVX512在KNL上的帮助超过4倍？ 8 double elements per AVX512 SIMD instruction on Knight's Landing, up from 2 with SSE2, would give an expected speedup of 4x if instruction throughput was the same. 如果指令吞吐量相同，则“骑士登陆”上每条AVX512 SIMD指令从8个double元素（从SSE2的2个元素增加到2个）将提供4倍的预期加速。 Assuming that total instruction count was identical with AVX512. 假设总指令数与AVX512相同。 (Which isn't the case: for the same loop unroll, the amount of vector work per loop overhead increases with wider vectors, plus other factors.) （情况并非如此：对于相同的循环展开，随着更宽的向量以及其他因素，每个循环开销的向量工作量会增加。）

Hard to say for sure without knowing what source code you were compiling. 在不知道要编译的源代码的情况下很难确定。 AVX512 adds some features that may help save instructions, like broadcast memory-source operands instead of requiring a separate broadcast load into a register. AVX512添加了一些有助于保存指令的功能，例如广播内存源操作数，而不是要求将单独的广播加载到寄存器中。

If your problem involved any division , KNL has extremely slow full-precision FP division, and should usually use an AVX512ER approximation instruction (28-bit precision) + a Newton-Raphson iteration (a couple FMA + mul) to double that, giving close to full double (53-bit significand, including 1 implicit bit). 如果您的问题涉及任何除法运算，则KNL具有非常慢的全精度FP除法运算，通常应使用AVX512ER近似指令（28位精度） +牛顿-拉夫森迭代（一对FMA + mul）将其加倍，从而得出近似的结果。到全double （53位有效数字，包括1个隐含位）。 -xMIC-AVX512 enables AVX512ER, and sets tuning options so ICC would actually choose to use it. -xMIC-AVX512启用AVX512ER，并设置调整选项，以便ICC实际上选择使用它。

(By contrast, Coffee Lake AVX 256-bit division throughput isn't any better than 128-bit division throughput in doubles per cycle, but without AVX512ER there isn't an efficient way to use Newton-Raphson for double ). （相比之下，Coffee Lake AVX的256位除法吞吐量并没有比128位除法吞吐量（每个周期加倍）好，但是如果没有AVX512ER，就没有有效的方法将Newton-Raphson用于double ）。 See Floating point division vs floating point multiplication - the Skylake numbers apply to your Coffee Lake. 请参阅浮点除法与浮点乘法 -Skylake数字适用于您的Coffee Lake。

AVX / AVX512 can avoid extra movaps instructions to copy registers , which helps a lot for KNL (every instruction that isn't a mul/add/FMA costs FP throughput, because it has 2-per-clock FMA but only 2-per clock max instruction throughput). AVX / AVX512可以避免额外的movaps指令来复制寄存器 ，这对KNL很有帮助（不是mul / add / FMA的每条指令都需要FP吞吐量，因为它具有2个时钟的FMA，但每个时钟只有2个最大指令吞吐量）。 ( https://agner.org/optimize/ ) （ https://agner.org/optimize/ ）

KNL is based on the Silvermont low-power core (that's how they fit so many cores onto one chip). KNL基于Silvermont低功耗内核（这就是它们将这么多内核集成到一个芯片上的方式）。

By contrast, Coffee Lake has a much more capable front-end and back-end execution throughput: it stall has 2 per clock FMA/mul/add, but 4 per clock total instruction throughput so there's room to run some non-FMA instructions without taking away from FMA throughput. 相比之下，Coffee Lake具有更强大的前端和后端执行吞吐量：它的停顿速度为每个时钟2个FMA / mul / add，但每个时钟总指令吞吐量为4个，因此有空间运行一些非FMA指令而无需降低了FMA的吞吐量。

Other slowdowns from running SSE/SSE2 instructions on KNL (Xeon Phi) 在KNL（Xeon Phi）上运行SSE / SSE2指令的其他速度下降

KNL is built specifically to run AVX512 code. KNL专为运行AVX512代码而构建。 They didn't waste transistors making it efficient running legacy code that wasn't compiled specifically for it (with -xMIC-AVX512 or -march=knl ). 他们没有浪费晶体管，从而使它可以高效地运行不是专门为其编译的旧代码（使用-xMIC-AVX512或-march=knl ）。

But your Coffee Lake is a mainstream desktop/laptop core that has to be fast running any past or future binaries, including code that only uses "legacy" SSE2 encodings of instruction, not AVX. 但是，Coffee Lake是主流的台式机/笔记本电脑核心，必须快速运行任何过去或将来的二进制文件，包括仅使用指令的“传统” SSE2编码而不是AVX的代码。

SSE2 instructions that write an XMM register leave the upper elements of the corresponding YMM/ZMM register unmodified. 写入XMM寄存器的SSE2指令不更改相应YMM / ZMM寄存器的高位元素。 (An XMM reg is the low 128 bits of the full vector reg). （XMM reg是完整矢量reg的低128位）。 This would in theory create a false dependency when running legacy SSE2 instructions on a CPU that supports wider vectors. 从理论上讲，当在支持较宽向量的CPU上运行旧版SSE2指令时，这将产生错误的依赖关系。 (Mainstream Intel CPUs like Sandybridge-family avoid this with mode transitions, or on Skylake actual false dependencies if you don't use vzeroupper properly. See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? for a comparison of the 2 strategies). （主流的Intel CPU（如Sandybridge系列）可通过模式转换来避免这种情况，如果不正确使用vzeroupper则避免使用Skylake实际的虚假依赖关系。有关这两种策略的比较，请参见为什么在Skylake上没有VZEROUPPER时，此SSE代码慢6倍？）。

KNL does apparently have a way to avoid false dependencies: According to Agner Fog's testing ( in his microarch guide ), he describes it as like the partial-register renaming that P6-family does when you write to integer registers like AL. KNL 也显然有办法，以避免错误的依赖：根据昂纳雾的测试（在他microarch指导），他把它描述为像局部寄存器重命名，当你写整数寄存器像AL P6系列一样。 You only get a partial-register stall when you read the full register. 阅读完整的寄存器时，只会导致部分寄存器停顿。 If that's accurate, then SSE2 code should run ok on KNL, because there's no AVX code reading the YMM or ZMM registers. 如果这是正确的，则SSE2代码应该可以在KNL上正常运行，因为没有AVX代码可以读取YMM或ZMM寄存器。

(But if there were false dependencies, a movaps xmm0, [rdi] in a loop might have to wait until the last instruction to write xmm0 in the previous iteration finished. That would defeat KNL's modest out-of-order execution ability to overlap independent work across loop iterations and hide load + FP latency.) （但是，如果存在错误的依赖关系， movaps xmm0, [rdi]的movaps xmm0, [rdi]可能必须等到上一次迭代中的最后一条写入xmm0指令完成。这会破坏KNL适度的无序执行重叠能力跨循环迭代工作并隐藏负载+ FP延迟。）

There's also the possibility of decode stalls on KNL when running legacy SSE/SSE2 instructions: it stalls on instructions with more than 3 prefixes, including 0F escape bytes. 运行旧版SSE / SSE2指令时，也有可能在KNL上解码停顿：它会停顿具有3个以上前缀（包括0F转义字节）的指令。 So for example any SSSE3 or SSE4.x instruction with a REX prefix to access r8..r15 or xmm8..xmm15 will cause a decode stall of 5 to 6 cycles. 因此，例如，任何带有REX前缀的SSSE3或SSE4.x指令访问r8..r15或xmm8..xmm15都会导致5到6个周期的解码停顿。

But you won't have that if you omitted all -x / -march options, because SSE1/SSE2 + REX is still fine. 但是如果省略所有-x / -march选项，则不会有此选择，因为SSE1 / SSE2 + REX仍然可以。 Just (optional REX) + 2 other prefixes for instructions like 66 0F 58 addpd . 只是（可选的REX）+ 2个其他前缀，用于类似66 0F 58 addpd指令。

See Agner Fog's microarch guide, in the KNL chapter: 16.2 instruction fetch and decoding . 请参阅KNL章节中的Agner Fog的微体系结构指南： 16.2指令获取和解码 。

OpenMP - if you're looking at OpenMP to use multiple threads, obviously KNL has many more cores. OpenMP的 -如果你正在寻找的OpenMP使用多线程，显然KNL有更多内核。

But even within one physical core, KNL has 4-way hyperthreading as another way (besides out-of-order exec) to hide the high-ish latency of its SIMD instructions. 但是，即使在一个物理内核内，KNL还是有4路超线程（除了无序的exec之外）来隐藏其SIMD指令的高延迟。 For example, FMA/add/sub latency is 6 cycles on KNL vs. 4 on Skylake/Coffee Lake. 例如，在KNL上FMA /添加/子延迟为6个周期，而在Skylake / Coffee Lake上为4个周期。

So breaking a problem up into multiple threads can sometimes significantly increase utilization of each individual core on KNL. 因此，将问题分解为多个线程有时可以显着提高KNL上每个单独内核的利用率。 But on a mainstream big-core CPU like Coffee Lake, its massive out-of-order execution capabilities can already find and exploit all the instruction-level parallelism in many loops, even if the loop body does a chain of things with each independent input. 但是在像Coffee Lake这样的主流大核CPU上，其庞大的乱序执行功能已经可以在许多循环中找到并利用所有指令级并行性，即使循环体对每个独立的输入执行了一系列操作。