为什么这个SIMD乘法不比非SIMD乘法快？

Question

Let's assume that we have a function that multiplies two arrays of 1000000 doubles each. 让我们假设我们有一个函数，每个函数乘以两个1000000双精度数组。 In C/C++ the function looks like this: 在C / C ++中，函数如下所示：

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

The compiler produces the following assembly with -O2 : 编译器使用-O2生成以下程序集：

mul_c(double*, double*):
        xor     eax, eax
.L2:
        movsd   xmm0, QWORD PTR [rdi+rax]
        mulsd   xmm0, QWORD PTR [rsi+rax]
        movsd   QWORD PTR [rdi+rax], xmm0
        add     rax, 8
        cmp     rax, 8000000
        jne     .L2
        rep ret

From the above assembly it seems that the compiler uses the SIMD instructions, but it only multiplies one double each iteration. 从上面的程序集看来，编译器似乎使用了SIMD指令，但每次迭代只会增加一倍。 So I decided to write the same function in inline assembly instead, where I make full use of the xmm0 register and multiply two doubles in one go: 所以我决定xmm0联汇编中编写相同的函数，在那里我充分利用xmm0寄存器并xmm0乘以两个双精度：

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

After measuring the execution time individually for both of these functions, it seems that both of them takes 1 ms to complete: 在单独测量这两个函数的执行时间之后，它们似乎都需要1毫秒才能完成：

> gcc -O2 main.cpp
> ./a.out < input

mul_c: 1 ms
mul_asm: 1 ms

[a lot of doubles...]

I expected the SIMD implementation to be atleast twice as fast (0 ms) as there is only half the amount of multiplications/memory instructions. 我期望SIMD实现至少快两倍（0 ms），因为乘法/存储器指令的数量只有一半。

So my question is: Why isn't the SIMD implementation faster than the ordinary C/C++ implementation when the SIMD implementation only does half the amount of multiplications/memory instructions? 所以我的问题是： 当SIMD实现只执行乘法/内存指令的一半时，为什么SIMD实现不比普通的C / C ++实现快？

Here's the full program: 这是完整的程序：

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

int main()
{
    struct timeval t1;
    struct timeval t2;
    unsigned long long time;

    double* a = (double*)malloc(sizeof(double) * 1000000);
    double* b = (double*)malloc(sizeof(double) * 1000000);
    double* c = (double*)malloc(sizeof(double) * 1000000);

    for (int i = 0; i != 1000000; ++i)
    {
        double v;
        scanf("%lf", &v);
        a[i] = v;
        b[i] = v;
        c[i] = v;
    }

    gettimeofday(&t1, NULL);
    mul_c(a, b);
    gettimeofday(&t2, NULL);
    time = 1000 * (t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec) / 1000;
    printf("mul_c: %llu ms\n", time);

    gettimeofday(&t1, NULL);
    mul_asm(b, c);
    gettimeofday(&t2, NULL);
    time = 1000 * (t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec) / 1000;
    printf("mul_asm: %llu ms\n\n", time);

    for (int i = 0; i != 1000000; ++i)
    {
        printf("%lf\t\t\t%lf\n", a[i], b[i]);
    }

    return 0;
}

I also tried to to make use of all xmm registers (0-7) and remove instruction dependencies to get better parallell computing: 我还试图利用所有xmm寄存器（0-7）并删除指令依赖性以获得更好的并行计算：

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix                 \n\t"
        "xor    rax, rax                        \n\t"
        "0:                                     \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax]     \n\t"
        "movupd xmm1, xmmword ptr [rdi+rax+16]  \n\t"
        "movupd xmm2, xmmword ptr [rdi+rax+32]  \n\t"
        "movupd xmm3, xmmword ptr [rdi+rax+48]  \n\t"
        "movupd xmm4, xmmword ptr [rdi+rax+64]  \n\t"
        "movupd xmm5, xmmword ptr [rdi+rax+80]  \n\t"
        "movupd xmm6, xmmword ptr [rdi+rax+96]  \n\t"
        "movupd xmm7, xmmword ptr [rdi+rax+112] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax]     \n\t"
        "mulpd  xmm1, xmmword ptr [rsi+rax+16]  \n\t"
        "mulpd  xmm2, xmmword ptr [rsi+rax+32]  \n\t"
        "mulpd  xmm3, xmmword ptr [rsi+rax+48]  \n\t"
        "mulpd  xmm4, xmmword ptr [rsi+rax+64]  \n\t"
        "mulpd  xmm5, xmmword ptr [rsi+rax+80]  \n\t"
        "mulpd  xmm6, xmmword ptr [rsi+rax+96]  \n\t"
        "mulpd  xmm7, xmmword ptr [rsi+rax+112] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0     \n\t"
        "movupd xmmword ptr [rdi+rax+16], xmm1  \n\t"
        "movupd xmmword ptr [rdi+rax+32], xmm2  \n\t"
        "movupd xmmword ptr [rdi+rax+48], xmm3  \n\t"
        "movupd xmmword ptr [rdi+rax+64], xmm4  \n\t"
        "movupd xmmword ptr [rdi+rax+80], xmm5  \n\t"
        "movupd xmmword ptr [rdi+rax+96], xmm6  \n\t"
        "movupd xmmword ptr [rdi+rax+112], xmm7 \n\t"
        "add    rax, 128                        \n\t"
        "cmp    rax, 8000000                    \n\t"
        "jne    0b                              \n\t"
        ".att_syntax noprefix                   \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

But it still runs at 1 ms, the same speed as the ordinary C/C++ implementation. 但它仍然以1毫秒运行，与普通的C / C ++实现速度相同。

UPDATES 更新

As suggested by answers/comments, I've implemented another way of measuring the execution time: 正如答案/评论所建议的，我已经实现了另一种测量执行时间的方法：

#include <stdio.h>
#include <stdlib.h>

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

void mul_asm2(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix                 \n\t"
        "xor    rax, rax                        \n\t"
        "0:                                     \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax]     \n\t"
        "movupd xmm1, xmmword ptr [rdi+rax+16]  \n\t"
        "movupd xmm2, xmmword ptr [rdi+rax+32]  \n\t"
        "movupd xmm3, xmmword ptr [rdi+rax+48]  \n\t"
        "movupd xmm4, xmmword ptr [rdi+rax+64]  \n\t"
        "movupd xmm5, xmmword ptr [rdi+rax+80]  \n\t"
        "movupd xmm6, xmmword ptr [rdi+rax+96]  \n\t"
        "movupd xmm7, xmmword ptr [rdi+rax+112] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax]     \n\t"
        "mulpd  xmm1, xmmword ptr [rsi+rax+16]  \n\t"
        "mulpd  xmm2, xmmword ptr [rsi+rax+32]  \n\t"
        "mulpd  xmm3, xmmword ptr [rsi+rax+48]  \n\t"
        "mulpd  xmm4, xmmword ptr [rsi+rax+64]  \n\t"
        "mulpd  xmm5, xmmword ptr [rsi+rax+80]  \n\t"
        "mulpd  xmm6, xmmword ptr [rsi+rax+96]  \n\t"
        "mulpd  xmm7, xmmword ptr [rsi+rax+112] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0     \n\t"
        "movupd xmmword ptr [rdi+rax+16], xmm1  \n\t"
        "movupd xmmword ptr [rdi+rax+32], xmm2  \n\t"
        "movupd xmmword ptr [rdi+rax+48], xmm3  \n\t"
        "movupd xmmword ptr [rdi+rax+64], xmm4  \n\t"
        "movupd xmmword ptr [rdi+rax+80], xmm5  \n\t"
        "movupd xmmword ptr [rdi+rax+96], xmm6  \n\t"
        "movupd xmmword ptr [rdi+rax+112], xmm7 \n\t"
        "add    rax, 128                        \n\t"
        "cmp    rax, 8000000                    \n\t"
        "jne    0b                              \n\t"
        ".att_syntax noprefix                   \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

unsigned long timestamp()
{
    unsigned long a;

    asm volatile
    (
        ".intel_syntax noprefix \n\t"
        "xor   rax, rax         \n\t"
        "xor   rdx, rdx         \n\t"
        "RDTSCP                 \n\t"
        "shl   rdx, 32          \n\t"
        "or    rax, rdx         \n\t"
        ".att_syntax noprefix   \n\t"

        : "=a" (a)
        : 
        : "memory", "cc"
    );

    return a;
}

int main()
{
    unsigned long t1;
    unsigned long t2;

    double* a;
    double* b;

    a = (double*)malloc(sizeof(double) * 1000000);
    b = (double*)malloc(sizeof(double) * 1000000);

    for (int i = 0; i != 1000000; ++i)
    {
        double v;
        scanf("%lf", &v);
        a[i] = v;
        b[i] = v;
    }

    t1 = timestamp();
    mul_c(a, b);
    //mul_asm(a, b);
    //mul_asm2(a, b);
    t2 = timestamp();
    printf("mul_c: %lu cycles\n\n", t2 - t1);

    for (int i = 0; i != 1000000; ++i)
    {
        printf("%lf\t\t\t%lf\n", a[i], b[i]);
    }

    return 0;
}

When I run the program with this measurement, I get this result: 当我使用此测量运行程序时，我得到以下结果：

mul_c:    ~2163971628 cycles
mul_asm:  ~2532045184 cycles
mul_asm2: ~5230488    cycles <-- what???

Two things are worth a notice here, first of all, the cycles count vary a LOT, and I assume that's because of the operating system allowing other processes to run inbetween. 这里有两件事值得注意，首先，周期数变化很多，我认为这是因为操作系统允许其他进程在其间运行。 Is there any way to prevent that or only count the cycles while my program is executed? 在程序执行时有没有办法阻止或只计算周期？ Also, mul_asm2 produces identical output compared to the other two, but it so much faster, how? 此外，与其他两个相比， mul_asm2产生相同的输出，但它更快，怎么样？

I tried Z boson's program on my system together with my 2 implementations and got the following result: 我在我的系统上尝试了Z boson的程序和我的2个实现，得到了以下结果：

> g++ -O2 -fopenmp main.cpp
> ./a.out
mul         time 1.33, 18.08 GB/s
mul_SSE     time 1.13, 21.24 GB/s
mul_SSE_NT  time 1.51, 15.88 GB/s
mul_SSE_OMP time 0.79, 30.28 GB/s
mul_SSE_v2  time 1.12, 21.49 GB/s
mul_v2      time 1.26, 18.99 GB/s
mul_asm     time 1.12, 21.50 GB/s
mul_asm2    time 1.09, 22.08 GB/s

Answer 1

Your asm code is really OK. 你的asm代码真的很好。 What is not is the way you measure it. 什么不是你衡量它的方式。 As I pointed in comments you should: 我在评论中指出你应该：

a) use way more iterations - 1 million is nothing for modern CPU a）使用更多迭代的方式 - 现代CPU没有100万

b) use HPT for measurement b）使用HPT进行测量

c) use RDTSC or RDTSCP to count real CPU clocks c）使用RDTSC或RDTSCP计算实际CPU时钟

Additionally why you are afraid of -O3 opt? 另外你为什么害怕-O3选择？ Don't forget to build code for your platform so use -march=native. 不要忘记为您的平台构建代码，因此请使用-march = native。 If your CPU supports AVX or AVX2 compiler will take opportunity to produce even better code. 如果你的CPU支持AVX或AVX2编译器，将有机会产生更好的代码。

Next thing - give compiler some hints about aliasing and allignment if you know you code. 接下来的事情 - 如果你知道你的代码，给编译器一些关于别名和allignment的提示。

Here is my version of your mul_c - yes it is GCC specific but you showed you used GCC 这是我的mul_c版本 - 是的，它是GCC特定的，但是你展示了你使用的GCC

void mul_c(double* restrict a, double* restrict b)
{
   a = __builtin_assume_aligned (a, 16);
   b = __builtin_assume_aligned (b, 16);

    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

It will produce: 它会产生：

mul_c(double*, double*):
        xor     eax, eax
.L2:
        movapd  xmm0, XMMWORD PTR [rdi+rax]
        mulpd   xmm0, XMMWORD PTR [rsi+rax]
        movaps  XMMWORD PTR [rdi+rax], xmm0
        add     rax, 16
        cmp     rax, 8000000
        jne     .L2
        rep ret

If you have AVX2 and make sure data is 32 bytes aligned it will become 如果你有AVX2并确保数据是32字节对齐它将成为

mul_c(double*, double*):
        xor     eax, eax
.L2:
        vmovapd ymm0, YMMWORD PTR [rdi+rax]
        vmulpd  ymm0, ymm0, YMMWORD PTR [rsi+rax]
        vmovapd YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 8000000
        jne     .L2
        vzeroupper
        ret

So no need for handcrafted asm if compiler can do it for you ;) 所以如果编译器可以为你做的话，不需要手工制作的asm;）

Answer 2

There was a major bug in the timing function I used for previous benchmarks. 我用于以前基准测试的计时功能存在一个主要错误。 This grossly underestimated the bandwidth without vectorization as well as other measurements. 这大大低估了没有矢量化的带宽以及其他测量。 Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. 此外，还有另一个问题是高估了由于读取但未写入的阵列上的COW而导致的带宽。 Finally, the maximum bandwidth I used was incorrect. 最后，我使用的最大带宽不正确。 I have updated my answer with the corrections and I have left the old answer at the end of this answer. 我已经用更正更新了我的答案，我在答案结尾处留下了旧答案。

Your operation is memory bandwidth bound. 您的操作是内存带宽限制。 This means the CPU is spending most of its time waiting on slow memory reads and writes. 这意味着CPU花费大部分时间等待慢速内存读写。 An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement . 对此的一个很好的解释可以在这里找到：为什么矢量化循环没有性能改进。

However, I have to disagree slightly with one statement in that answer. 但是，我不得不对该答案中的一个陈述略有不同意见。

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster. 因此无论它如何优化（矢量化，展开等等），它都不会变得更快。

In fact, vectorization ~~, unrolling,~~ and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. 事实上，即使在内存带宽限制操作中，矢量化~~，展开~~和多线程也可以显着增加带宽。 The reason is that it is difficult to obtain the maximum memory bandwidth. 原因是很难获得最大内存带宽。 A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702 . 可以在此处找到对此的一个很好的解释： https ： //stackoverflow.com/a/25187492/2542702 。

The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth. 我的其余部分将展示矢量化和多线程如何更接近最大内存带宽。

My test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM, dual channel DDR4@2400 GHz. 我的测试系统：Ubuntu 16.10，Skylake（i7-6700HQ@2.60GHz），32GB RAM，双通道DDR4 @ 2400 GHz。 The maximum bandwidth from my system is 38.4 GB/s. 我系统的最大带宽为38.4 GB / s。

From the code below I produce the following tables. 从下面的代码我产生下表。 I set the number of thread using OMP_NUM_THREADS eg export OMP_NUM_THREADS=4 . 我使用OMP_NUM_THREADS设置线程数，例如export OMP_NUM_THREADS=4 。 The efficiency is bandwidth/max_bandwidth . 效率是bandwidth/max_bandwidth 。

-O2 -march=native -fopenmp
Threads Efficiency
1       59.2%
2       76.6%
4       74.3%
8       70.7%

-O2 -march=native -fopenmp -funroll-loops
1       55.8%
2       76.5%
4       72.1%
8       72.2%

-O3 -march=native -fopenmp
1       63.9%
2       74.6%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128
1       67.8%
2       76.0%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
1       68.8%
2       73.9%
4       69.0%
8       66.8%

After several iterations of running due to uncertainties in the measurements I have formed the following conclusions: 由于测量的不确定性，经过几次运行迭代，我得出以下结论：

single threaded scalar operations get more than 50% of the bandwidth. 单线程标量操作获得超过50％的带宽。
two threaded scalar operations get the highest bandwidth. 两个线程标量操作获得最高带宽。
single threaded vector operations are faster than single threaded scalar operations. 单线程向量操作比单线程标量操作更快。
single threaded SSE operations are faster than single threaded AVX operations. 单线程SSE操作比单线程AVX操作更快。
unrolling is not helpful. 展开没有帮助。
unrolling single-threaded operations is slower than without unrolling. 展开单线程操作比没有展开时慢。
more threads than cores (Hyper-Threading) gives a lower bandwidth. 线程多于核心（超线程），带宽更低。

The solution that gives the best bandwidth is scalar operations with two threads. 提供最佳带宽的解决方案是具有两个线程的标量操作。

The code I used to benchmark: 我用来测试的代码：

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>

#define N 10000000
#define R 100

void mul(double *a, double *b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits 
  double mem = 3*sizeof(double)*N*R*1E-9; // GB

  double *a = (double*)malloc(sizeof *a * N);
  double *b = (double*)malloc(sizeof *b * N);

  //due to copy-on-write b must be initialized to get the correct bandwidth
  //also, GCC will convert malloc + memset(0) to calloc so use memset(1)
  memset(b, 1, sizeof *b * N);

  double dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += omp_get_wtime();
  printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  free(a), free(b);
}

The old solution with the timing bug 具有时间错误的旧解决方案

The modern solution for inline assembly is to use intrinsics. 内联汇编的现代解决方案是使用内在函数。 There are still cases where one needs inline assembly but this is not one of them. 仍然存在需要内联汇编的情况，但这不是其中之一。

One intrinsics solution for you inline assembly approach is simply: 内联汇编方法的一个内在解决方案很简单：

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) 
      _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

Let me define some test code 我来定义一些测试代码

#include <x86intrin.h>
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>

#define N 1000000
#define R 1000

typedef __attribute__(( aligned(32)))  double aligned_double;
void  (*fp)(aligned_double *a, aligned_double *b);

void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_NT(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_OMP(double*  a, double*  b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void test(aligned_double *a, aligned_double *b, const char *name) {
  double dtime;
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) fp(a,b);
  dtime += omp_get_wtime();
  printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
}

int main() {
  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *a * N);

  fp = mul,         test(a,b, "mul        ");
  fp = mul_SSE,     test(a,b, "mul_SSE    ");
  fp = mul_SSE_NT,  test(a,b, "mul_SSE_NT ");
  fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");

  _mm_free(a), _mm_free(b);
}

Now the first test 现在是第一次测试

g++ -O2 -fopenmp test.cpp
./a.out
mul              time 1.67 s, 13.1 GB/s, efficiency 38.5%
mul_SSE          time 1.00 s, 21.9 GB/s, efficiency 64.3%
mul_SSE_NT       time 1.05 s, 20.9 GB/s, efficiency 61.4%
mul_SSE_OMP      time 0.74 s, 29.7 GB/s, efficiency 87.0%

So with -O2 which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul . 因此，对于没有矢量化循环的-O2 ，我们看到内在SSE版本比普通C解决方案mul快得多。 efficiency = bandwith_measured/max_bandwidth where the max is 34.1 GB/s for my system. efficiency = bandwith_measured/max_bandwidth其中我的系统的最大值为34.1 GB / s。

Second test 第二次测试

g++ -O3 -fopenmp test.cpp
./a.out
mul              time 1.05 s, 20.9 GB/s, efficiency 61.2%
mul_SSE          time 0.99 s, 22.3 GB/s, efficiency 65.3%
mul_SSE_NT       time 1.01 s, 21.7 GB/s, efficiency 63.7%
mul_SSE_OMP      time 0.68 s, 32.5 GB/s, efficiency 95.2%

With -O3 vectorizes the loop and the intrinsic function offers essentially no advantage. 使用-O3矢量化循环，内在函数基本上没有优势。

Third test 第三次测试

g++ -O3 -fopenmp -funroll-loops test.cpp
./a.out
mul              time 0.85 s, 25.9 GB/s, efficency 76.1%
mul_SSE          time 0.84 s, 26.2 GB/s, efficency 76.7%
mul_SSE_NT       time 1.06 s, 20.8 GB/s, efficency 61.0%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficency 85.0%

With -funroll-loops GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution. 使用-funroll-loops GCC将循环展开八次，除了非临时存储解决方案外，我们看到了一个显着的改进，而不是OpenMP解决方案的真正优势。

Before unrolling the loop the assembly for mul wiht -O3 is 在展开循环之前， mul wiht -O3的组件是

    xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    add     rax, 16
    cmp     rax, 8000000
    jne     .L2
    rep ret

With -O3 -funroll-loops the assembly for mul is: 使用-O3 -funroll-loops ， mul的程序集是：

   xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    movupd  xmm1, XMMWORD PTR [rsi+16+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movupd  xmm2, XMMWORD PTR [rsi+32+rax]
    mulpd   xmm1, XMMWORD PTR [rdi+16+rax]
    movupd  xmm3, XMMWORD PTR [rsi+48+rax]
    mulpd   xmm2, XMMWORD PTR [rdi+32+rax]
    movupd  xmm4, XMMWORD PTR [rsi+64+rax]
    mulpd   xmm3, XMMWORD PTR [rdi+48+rax]
    movupd  xmm5, XMMWORD PTR [rsi+80+rax]
    mulpd   xmm4, XMMWORD PTR [rdi+64+rax]
    movupd  xmm6, XMMWORD PTR [rsi+96+rax]
    mulpd   xmm5, XMMWORD PTR [rdi+80+rax]
    movupd  xmm7, XMMWORD PTR [rsi+112+rax]
    mulpd   xmm6, XMMWORD PTR [rdi+96+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    mulpd   xmm7, XMMWORD PTR [rdi+112+rax]
    movaps  XMMWORD PTR [rdi+16+rax], xmm1
    movaps  XMMWORD PTR [rdi+32+rax], xmm2
    movaps  XMMWORD PTR [rdi+48+rax], xmm3
    movaps  XMMWORD PTR [rdi+64+rax], xmm4
    movaps  XMMWORD PTR [rdi+80+rax], xmm5
    movaps  XMMWORD PTR [rdi+96+rax], xmm6
    movaps  XMMWORD PTR [rdi+112+rax], xmm7
    sub     rax, -128
    cmp     rax, 8000000
    jne     .L2
    rep ret

Fourth test 第四次测试

g++ -O3 -fopenmp -mavx test.cpp
./a.out
mul              time 0.87 s, 25.3 GB/s, efficiency 74.3%
mul_SSE          time 0.88 s, 24.9 GB/s, efficiency 73.0%
mul_SSE_NT       time 1.07 s, 20.6 GB/s, efficiency 60.5%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficiency 85.2%

Now the non-intrinsic function is the fastest (excluding the OpenMP version). 现在非内在函数是最快的（不包括OpenMP版本）。

So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (eg -O3 , -funroll-loops , -mavx ). 因此在这种情况下没有理由使用内在函数或内联汇编，因为我们可以使用适当的编译器选项（例如-O3 ， -funroll-loops ， -mavx ）获得最佳性能。

Test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM. 测试系统：Ubuntu 16.10，Skylake（i7-6700HQ@2.60GHz），32GB RAM。 Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz 最大内存带宽（34.1 GB / s） https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz

Here is another solution worth considering. 这是另一个值得考虑的解决方案 The cmp instruction is not necessary if we count from -N up to zero and access the arrays as N+i . 如果我们从-N计数到零并且将数组作为N+i访问，则不需要cmp指令。 GCC should have fixed this a long time ago. 海湾合作委员会应该在很久以前修复过这个问题。 It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op). 它消除了一条指令（尽管由于宏操作融合，cmp和jmp通常算作一个微操作）。

void mul_SSE_v2(double*  a, double*  b) {
  for (ptrdiff_t i = -N; i<0; i+=2)
    _mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));

Assembly with -O3 使用-O3装配

mul_SSE_v2(double*, double*):
    mov     rax, -1000000
.L9:
    movapd  xmm0, XMMWORD PTR [rdi+8000000+rax*8]
    mulpd   xmm0, XMMWORD PTR [rsi+8000000+rax*8]
    movaps  XMMWORD PTR [rdi+8000000+rax*8], xmm0
    add     rax, 2
    jne     .L9
    rep ret
}

This optimization will only possibly be helpful the arrays fit eg the L1 cache ie not reading from main memory. 这种优化仅可能有助于阵列适合例如L1高速缓存，即不从主存储器读取。

I finally found a way to get the plain C solution to not generate the cmp instruction. 我终于找到了一种方法来获得普通的C解决方案，而不是生成cmp指令。

void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[i] *= b[i];
}

And then call the function from a separate object file like this mul_v2(&a[N],&b[N]) so this is perhaps the best solution. 然后从一个单独的目标文件中调用该函数，如mul_v2(&a[N],&b[N]) ，这可能是最好的解决方案。 However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp instruction again. 但是，如果从与GCC中定义的目标文件（转换单元）相同的目标文件（转换单元）调用该函数，则会再次生成cmp指令。

Also, 也，

void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
}

still generates the cmp instruction and generates the same assembly as the mul function. 仍然生成cmp指令并生成与mul函数相同的程序集。

The function mul_SSE_NT is silly. 函数mul_SSE_NT很傻。 It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results. 它使用非时间存储，仅在写入存储器时才有用，但由于函数读取和写入相同的地址，非临时存储不仅无用，而且会产生较差的结果。

Previous versions of this answer were getting the wrong bandwidth. 此答案的先前版本获得了错误的带宽。 The reason was when the arrays were not initialized. 原因是阵列没有初始化。

Answer 3

I want to add another point of view to the problem. 我想为问题添加另一种观点。 SIMD instructions give big performance boost if there is no memory bound restrictions. 如果没有内存限制，SIMD指令可以大大提升性能。 But there are too much memory loading and storing operations and too few CPU calculations in current example. 但是在当前示例中存在太多的内存加载和存储操作以及太少的CPU计算。 So CPU is in time to process incoming data without using SIMD. 因此CPU及时处理传入数据而不使用SIMD。 If you use data of another type (32-bit float for example) or more complex algorithm, memory throughput won't restrict CPU performance and using of SIMD will give more advantages. 如果您使用其他类型的数据（例如32位浮点数）或更复杂的算法，则内存吞吐量不会限制CPU性能，使用SIMD将带来更多优势。

为什么这个SIMD乘法不比非SIMD乘法快？

问题描述

UPDATES 更新

3 个解决方案

解决方案1
8 2017-03-23 00:48:21

解决方案2
8 已采纳 2017-03-23 09:57:51

The old solution with the timing bug 具有时间错误的旧解决方案

解决方案3
4 2017-03-23 08:51:07

为什么这个SIMD乘法不比非SIMD乘法快？

问题描述

UPDATES 更新

3 个解决方案

解决方案1 8 2017-03-23 00:48:21

解决方案2 8 已采纳 2017-03-23 09:57:51

The old solution with the timing bug 具有时间错误的旧解决方案

解决方案3 4 2017-03-23 08:51:07

解决方案1
8 2017-03-23 00:48:21

解决方案2
8 已采纳 2017-03-23 09:57:51

解决方案3
4 2017-03-23 08:51:07