简体繁体 English

为什么矢量化通常比循环更快？

[英]Why is vectorization, faster in general, than loops?

原文 2016-01-29 18:55:14 8 3 performance/ language-agnostic/ vectorization/ simd/ low-level

Why, at the lowest level of the hardware performing operations and the general underlying operations involved (ie: things general to all programming languages' actual implementations when running code), is vectorization typically so dramatically faster than looping?为什么在硬件执行操作的最低级别和所涉及的一般底层操作（即：运行代码时所有编程语言的实际实现通用的东西），向量化通常比循环快得多？

What does the computer do when looping that it doesn't do when using vectorization (I'm talking about the actual computations that the computer performs, not what the programmer writes), or what does it do differently?计算机在循环时做什么而使用向量化时它不做（我说的是计算机执行的实际计算，而不是程序员编写的内容），或者它有什么不同？

I have been unable to convince myself why the difference should be so significant.我一直无法说服自己为什么差异会如此显着。 I could probably be persuaded that vectorized code shaves off some looping overhead somewhere, but the computer still has to perform the same number of operations, doesn't it?我可能会相信矢量化代码在某处减少了一些循环开销，但计算机仍然必须执行相同数量的操作，不是吗？ For example, if we're multiplying a vector of size N by a scalar, we'll have N multiplications to perform either way, won't we?例如，如果我们将一个大小为 N 的向量乘以一个标量，我们将有 N 次乘法来执行任一方法，不是吗？

3 个解决方案

Vectorization (as the term is normally used) refers to SIMD (single instruction, multiple data) operation.矢量化（作为通常使用的术语）是指 SIMD（单指令，多数据）操作。

That means, in essence, that one instruction carries out the same operation on a number of operands in parallel.这意味着，从本质上讲，一条指令并行地对多个操作数执行相同的操作。 For example, to multiply a vector of size N by a scalar, let's call M the number of operands that size that it can operate on simultaneously.例如，要将大小为 N 的向量乘以标量，我们将 M 称为它可以同时操作的大小的操作数的数量。 If so, then the number of instructions it needs to execute is approximately N/M, where (with purely scalar operations) it would have to carry out N operations.如果是这样，那么它需要执行的指令数量大约是 N/M，其中（纯标量操作）它必须执行 N 个操作。

For example, Intel's current AVX 2 instruction set uses 256-bit registers.例如，英特尔当前的 AVX 2 指令集使用 256 位寄存器。 These can be used to hold (and operate on) a set of 4 operands of 64-bits apiece, or 8 operands of 32 bits apiece.这些可用于保存（和操作）一组 4 个 64 位的操作数，或 8 个 32 位的操作数。

So, assuming you're dealing with 32-bit, single-precision real numbers, that means a single instruction can do 8 operations (multiplications, in your case) at once, so (at least in theory) you can finish N multiplications using only N/8 multiplication instructions.因此，假设您正在处理 32 位单精度实数，这意味着一条指令可以一次执行 8 次运算（在您的情况下为乘法），因此（至少在理论上）您可以使用完成 N 次乘法只有 N/8 条乘法指令。 At least, in theory, this should allow the operation to finish about 8 times as fast as executing one instruction at a time would allow.至少，理论上，这应该允许操作以一次执行一条指令所允许的大约 8 倍的速度完成。

Of course, the exact benefit depends on how many operands you support per instruction.当然，确切的好处取决于每条指令支持多少个操作数。 Intel's first attempts only supported 64-bit registers, so to operate on 8 items at once, those items could only be 8 bits apiece. Intel 的第一次尝试仅支持 64 位寄存器，因此要同时操作 8 个项目，这些项目每个只能是 8 位。 They currently support 256-bit registers, and they've announced support for 512-bit (and they may have even shipped that in a few high-end processors, but not in normal consumer processors, at least yet).他们目前支持 256 位寄存器，并且他们已经宣布支持 512 位（他们甚至可能已经在一些高端处理器中提供了这种支持，但至少在普通消费者处理器中没有提供）。 Making good use of this capability can also be non-trivial, to put it mildly.委婉地说，充分利用此功能也并非易事。 Scheduling instructions so you actually have N operands available and in the right places at the right times isn't necessarily an easy task (at all).调度指令以便您实际上有 N 个操作数可用并且在正确的时间在正确的位置不一定是一项容易的任务（根本）。

To put things in perspective, the (now ancient) Cray 1 gained a lot of its speed exactly this way.从正确的角度来看，（现在古老的）Cray 1 正是通过这种方式获得了很多速度。 Its vector unit operated on sets of 64 registers of 64 bits apiece, so it could do 64 double-precision operations per clock cycle.它的向量单元对每组 64 位的 64 个寄存器进行操作，因此每个时钟周期可以进行 64 次双精度运算。 On optimally vectorized code, it was much closer to the speed of a current CPU than you might expect based solely on its (much lower) clock speed.在最佳矢量化代码中，它更接近当前 CPU 的速度，而不是仅基于其（低得多）时钟速度的预期。 Taking full advantage of that wasn't always easy though (and still isn't).充分利用这一点并不总是那么容易（现在仍然不是）。

Keep in mind, however, that vectorization is not the only way in which a CPU can carry out operations in parallel.但是请记住，矢量化并不是CPU 可以并行执行操作的唯一方式。 There's also the possibility of instruction-level parallelism, which allows a single CPU (or the single core of a CPU) to execute more than one instruction at a time.还有指令级并行的可能性，它允许单个 CPU（或 CPU 的单个内核）一次执行多条指令。 Most modern CPUs include hardware to (theoretically) execute up to around 4 instructions per clock cycle ¹ if the instructions are a mix of loads, stores, and ALU.如果指令是加载、存储和 ALU 的混合，大多数现代 CPU 包括硬件（理论上）每个时钟周期¹最多执行大约 4 条指令。 They can fairly routinely execute close to 2 instructions per clock on average, or more in well-tuned loops when memory isn't a bottleneck.当内存不是瓶颈时，它们可以相当常规地平均每个时钟执行接近 2 条指令，或者在调整良好的循环中执行更多指令。

Then, of course, there's multi-threading--running multiple streams of instructions on (at least logically) separate processors/cores.然后，当然，还有多线程——在（至少在逻辑上）独立的处理器/内核上运行多个指令流。

So, a modern CPU might have, say, 4 cores, each of which can execute 2 vector multiplies per clock, and each of those instructions can operate on 8 operands.因此，现代 CPU 可能有 4 个内核，每个内核每个时钟可以执行 2 个向量乘法，并且每个指令都可以在 8 个操作数上运行。 So, at least in theory, it can be carrying out 4 * 2 * 8 = 64 operations per clock.因此，至少在理论上，它每个时钟可以执行 4 * 2 * 8 = 64 次操作。

Some instructions have better or worse throughput.一些指令有更好或更差的吞吐量。 For example, FP adds throughput is lower than FMA or multiply on Intel before Skylake (1 vector per clock instead of 2).例如，在 Skylake 之前的 Intel 上，FP 添加的吞吐量低于 FMA 或乘法（每个时钟 1 个向量而不是 2 个）。 But boolean logic like AND or XOR has 3 vectors per clock throughput;但是像 AND 或 XOR 这样的布尔逻辑每个时钟吞吐量有 3 个向量； it doesn't take many transistors to build an AND/XOR/OR execution unit, so CPUs replicate them.构建 AND/XOR/OR 执行单元不需要很多晶体管，因此 CPU 会复制它们。 Bottlenecks on the total pipeline width (the front-end that decodes and issues into the out-of-order part of the core) are common when using high-throughput instructions, rather than bottlenecks on a specific execution unit.使用高吞吐量指令时，总流水线宽度（前端解码并发布到内核的乱序部分）上的瓶颈很常见，而不是特定执行单元上的瓶颈。

But, over time CPUs tend to have more resources available, so this number rises.但是，随着时间的推移，CPU 往往会拥有更多可用资源，因此这个数字会上升。

Vectorization has two main benefits.矢量化有两个主要好处。

The primary benefit is that hardware designed to support vector instructions generally has hardware that is capable of performing multiple ALU operations in parallel when vector instructions are used.主要优点是设计用于支持向量指令的硬件通常具有在使用向量指令时能够并行执行多个 ALU 操作的硬件。 For example, if you ask it to perform 16 additions with a 16-element vector instruction, it may have 16 adders that can do all the additions at once, in parallel.例如，如果您要求它使用 16 元素向量指令执行 16 次加法，则它可能有 16 个加法器可以同时并行执行所有加法。 The only way to access all those adders ¹ is through vectorization.访问所有这些加法器¹的唯一方法是通过矢量化。 With scalar instructions you just get the 1 lonely adder.使用标量指令，您只会得到 1 个孤独的加法器。
There is usually some overhead saved by using vector instructions.使用向量指令通常可以节省一些开销。 You load and store data in big chunks (up to 512 bits at a time on some recent Intel CPUs) and each loop iteration does more work so the loop overhead is generally lower in a relative sense ² , and you need fewer instructions to do the same work so the CPU front-end overhead is lower, etc.您以大块的形式加载和存储数据（在一些最近的 Intel CPU 上一次最多 512 位），并且每次循环迭代都会做更多的工作，因此相对而言循环开销通常较低² ，并且您需要更少的指令来执行同样的工作，所以 CPU 前端开销更低，等等。

Finally, your dichotomy between loops and vectorization is odd.最后，循环和矢量化之间的二分法很奇怪。 When you take non-vector code and vectorize it, you are generally going to end up with a loop if there was a loop there before, or not if there wasn't.当您使用非矢量代码并将其矢量化时，如果之前有循环，您通常会以循环结束，如果没有，则不会结束。 The comparison is really between scalar (non-vector) instructions and vector instructions.比较实际上是在标量（非向量）指令和向量指令之间进行的。

¹ Or at least 15 of the 16, perhaps one is used also to do scalar operations. ¹或至少 16 个中的 15 个，也许其中一个也用于进行标量运算。

² You could probably get a similar loop-overhead benefit in the scalar case at the cost of a lot of loop unrolling. ²在标量情况下，您可能会以大量循环展开为代价获得类似的循环开销优势。

Vectorization is a type of parallel processing.矢量化是一种并行处理。 It enables more computer hardware to be devoted to performing the computation, so the computation is done faster.它使更多的计算机硬件用于执行计算，因此计算速度更快。

Many numerical problems, especially solution of partial differential equations, require the same calculation to be performed for a large number of cells, elements or nodes.许多数值问题，尤其是偏微分方程的求解，需要对大量单元、单元或节点进行相同的计算。 Vectorization performs the calculation for many cells/elements/nodes in parallel.矢量化并行执行许多单元格/元素/节点的计算。

Vectorization uses special hardware.矢量化使用特殊硬件。 Unlike a multicore CPU, for which each of the parallel processing units is a fully functional CPU core, vector processing units can perform only simple operations, and all the units perform the same operation at the same time, operating on a sequence of data values (a vector) simultaneously.与多核 CPU 不同，每个并行处理单元都是一个功能齐全的 CPU 内核，向量处理单元只能执行简单的操作，所有单元同时执行相同的操作，对一系列数据值进行操作（向量）同时。