使用openmp + SIMD没有加速

Question

I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation. 我是Openmp的新手，现在尝试使用Openmp + SIMD内在函数来加速我的程序，但结果远非期望。

In order to simplify the case without losing much essential information, I wrote a simplier toy example: 为了简化案例而不丢失太多重要信息，我写了一个简单的玩具示例：

#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>

#include "immintrin.h" // for SIMD intrinsics

int main() {
    int64_t size = 160000000;
    std::vector<int> src(size);

    // generating random src data
    for (int i = 0; i < size; ++i)
        src[i] = (rand() / (float)RAND_MAX) * size;

    // to store the final results, so size is the same as src
    std::vector<int> dst(size);

    // get pointers for vector load and store
    int * src_ptr = src.data();
    int * dst_ptr = dst.data();

    __m256i vec_src;
    __m256i vec_op = _mm256_set1_epi32(2);
    __m256i vec_dst;

    omp_set_num_threads(4); // you can change thread count here

    // only measure the parallel part
    struct timeval one, two;
    double get_time;
    gettimeofday (&one, NULL);

    #pragma omp parallel for private(vec_src, vec_op, vec_dst)
    for (int64_t i = 0; i < size; i += 8) {
        // load needed data
        vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));

        // computation part
        vec_dst = _mm256_add_epi32(vec_src, vec_op);
        vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
        vec_dst = _mm256_slli_epi32(vec_dst, 1);
        vec_dst = _mm256_add_epi32(vec_dst, vec_src);
        vec_dst = _mm256_sub_epi32(vec_dst, vec_src);

        // store results
        _mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
    }

    gettimeofday(&two, NULL);
    double oneD = one.tv_sec + (double)one.tv_usec * .000001;
    double twoD = two.tv_sec + (double)two.tv_usec * .000001;
    get_time = 1000 * (twoD - oneD);
    std::cout << "took time: " << get_time << std::endl;

    // output something in case the computation is optimized out
    int64_t i = (int)((rand() / (float)RAND_MAX) * size);
    for (int64_t i = 0; i < size; ++i)
        std::cout << i << ": " << dst[i] << std::endl;

    return 0;
}

It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. 它使用icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test并测量并行部分的经过时间。 The result is as follows (the median value is picked out of 5 runs each): 结果如下（中间值从每次5次运行中挑选出来）：

1 thread: 92.519

2 threads: 89.045

4 threads: 90.361

The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups? 计算似乎令人尴尬地并行，因为不同的线程可以在给定不同索引的情况下同时加载所需的数据，并且类似于写入结果，但为什么没有加速？

More information: 更多信息：

I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated; 我使用icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp检查了汇编代码，发现了矢量化指令;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60 , but it does not change much if I change the thread count from 1 -> 2 -> 4 . 为了检查它是否受内存限制，我在循环中评估了计算部分，并且测量的时间减少到大约60 ，但是如果我将线程数从1 -> 2 -> 4改变，它没有太大变化。

Any advice or clue is welcome. 欢迎任何建议或线索。

EDIT-1: 编辑-1：

Thank @JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. 感谢@JerryCoffin指出可能的原因，所以我使用Vtune进行了内存访问分析。 Here are the results: 结果如下：

1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039

2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015

4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003

It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. 它是英特尔4770处理器，最大25.6GB / s（Vtune测量为23GB / s）。 bandwidth. 带宽。 The memory bound does increase, but I am still not sure if that is the cause. 内存限制确实增加了，但我仍然不确定这是否是原因。 Any advice? 有什么建议？

EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully): EDIT-2（只是试图提供全面的信息，所以附加的东西可能很长，但不是很乏味的希望）：

Thanks for the suggestions from @PaulR and @bazza. 感谢@PaulR和@bazza的建议。 I tried 3 ways for comparison. 我尝试了3种方法进行比较。 One thing to note is that the processor has 4 cores and 8 hardware threads. 需要注意的一点是，处理器有4内核和8硬件线程。 Here are the results: 结果如下：

(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 （1）只需将dst初始化为全零： 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective; 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 ---似乎没有效果;

(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 （2）没有（1），将并行部分放在外循环中超过100次迭代，并测量100次迭代的时间： 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads; 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 ---除8个线程外非常有效;

(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 （3）基于（2），在100次迭代之前再放一次迭代，并测量100次迭代的时间： 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads. 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 ---与（2）相似但对8个主题更有效。

It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations. 似乎更多的迭代可以暴露openmp + SIMD的优点，但是计算/内存访问比率不管循环计数都没有改变，并且局部性似乎也不是原因，因为src或dst太大而不能留在任何缓存中，因此连续迭代之间不存在关系。

Any advice? 有什么建议？

EDIT 3: 编辑3：

In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop 如果有误导性，有一点需要澄清：在（2）和（3）中，openmp指令在添加的外部循环之外

#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
    for (int64_t i = 0; i < size; i += 8) {
        ......
    }
}

ie the outer loop is parallelized using multithreads, and the inner loop is still serially processed. 即外环使用多线程并行化，内环仍然是串行处理的。 So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads. 因此，（2）和（3）中的有效加速可以通过增强线程之间的局部性来实现。

I did another experiment that the the openmp directive is put inside the outer loop: 我做了另一个实验，将openmp指令放在外部循环中：

for (int k = 0; k < 100; ++k) {
    #pragma omp parallel for private(vec_src, vec_op, vec_dst)
    for (int64_t i = 0; i < size; i += 8) {
        ......
    }
}

and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83 并且加速仍然不好： 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83 . 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83 。

Problem still exists. 问题仍然存在。 :( :(

EDIT-4: 编辑-4：

If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way): 如果我用这样的标量操作替换矢量化部分（相同的计算但是以标量方式）：

#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
    int query = src[i];
    int res = src[i] + 2;
    res = res * query;
    res = res << 1;
    res = res + query;
    res = res - query;
    dst[i] = res;
}

The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864 加速是1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864 . 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864 。 May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? 我可以得出结论，看似尴尬的并行实际上是内存限制（瓶颈是加载/存储操作）？ If so, why can't load / store operations well parallelized? 如果是这样，为什么不能很好地并行加载/存储操作？

Answer 1

May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? 我可以得出结论，看似尴尬的并行实际上是内存限制（瓶颈是加载/存储操作）？ If so, why can't load / store operations well parallelized? 如果是这样，为什么不能很好地并行加载/存储操作？

Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. 是的，这个问题令人尴尬地平行 ，因为缺乏依赖性很容易并行化。 That doesn't imply that it will scale perfectly. 这并不意味着它会完美地扩展。 You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup. 您仍然可能有较差的初始化开销与工作比率或共享资源限制您的加速。

In your case, you are indeed limited by memory bandwidth. 在您的情况下，您确实受到内存带宽的限制。 A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr . 首先要考虑的是：当使用icpc（16.0.3或17.0.1）进行编译时， “标量”版本在制作constexpr size会生成更好的代码。 This is not due to the fact that it optimizes away these two redundant lines: 这不是因为它优化了这两条冗余线：

res = res + query;
res = res - query;

It does, but that makes no difference. 它确实如此，但这没有任何区别。 Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. 除了商店之外，主要是编译器使用与内在函数完全相同的指令。 Fore the store, it uses vmovntdq instead of vmovdqu , making use of sophisticated knowledge about the program, memory and the architecture. 在商店之前，它使用vmovntdq而不是vmovdqu ，利用有关程序，内存和架构的复杂知识。 Not only does vmovntdq require aligned memory and can therefore be more efficient. vmovntdq不仅需要对齐的内存，因此可以更高效。 It gives the CPU a non-temporal hint , preventing this data from being cached during the write to memory. 它为CPU提供非临时提示 ，防止在写入内存期间缓存此数据。 This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. 这样可以提高性能，因为将其写入缓存需要从内存加载缓存行的其余部分。 So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. 因此，虽然您的初始SIMD版本确实需要三个内存操作：读取源，读取目标缓存行，写入目标，具有非临时存储的编译器版本只需要两个。 In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. 实际上在我的i7-4770系统上，编译器生成的版本将2个线程的运行时间从~85.8毫秒减少到58.0毫秒，几乎完美的1.5倍加速。 The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well. 这里的教训是相信你的编译器，除非你非常了解架构和指令集。

Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. 考虑到峰值性能，传输2 * 160000000 * 4字节的58 ms对应于22.07 GB / s（总结读写），这与您的VTune结果大致相同。 (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). （有趣的是，考虑到85.8 ms与两次读取相同的带宽，一次写入）。 There isn't much more direct room for improvement. 没有更多的直接改进空间。

To further improve performance, you would have to do something about the operation / byte ratio of your code. 要进一步提高性能，您必须对代码的操作/字节比做一些事情。 Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for int ops), but can only read&write 3.2 G int/s. 请记住，您的处理器可以执行217.6 GFLOP / s（对于int操作我猜相同或两次），但只能读写3.2 G int / s。 That gives you an idea how much operations you need to perform to not be limited by memory. 这可以让您了解需要执行多少操作才能不受内存限制。 So if you can, work on the data in blocks so that you can reuse data in caches. 因此，如果可以的话，可以在块中处理数据，以便可以在缓存中重用数据。

I cannot reproduce your results for (2) and (3). 我无法重现（2）和（3）的结果。 When I loop around the inner loop, the scaling behaves the same. 当我循环内部循环时，缩放行为相同。 The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. 结果看起来很可疑，特别是鉴于结果与峰值性能如此一致。 Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such: 一般来说，我建议在并行区域内进行测量并利用omp_get_wtime如下所示：

  double one, two;
#pragma omp parallel 
  {
    __m256i vec_src;
    __m256i vec_op = _mm256_set1_epi32(2);   
    __m256i vec_dst;

#pragma omp master
    one = omp_get_wtime();
#pragma omp barrier
    for (int kk = 0; kk < 100; kk++)
#pragma omp for
    for (int64_t i = 0; i < size; i += 8) {
        ...
    }
#pragma omp master
    {
      two = omp_get_wtime();
      std::cout << "took time: " << (two-one) * 1000 << std::endl;
    }
  }

A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. 最后一点：桌面处理器和服务器处理器在内存性能方面有着截然不同的特性。 On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth. 在现代服务器处理器上，您需要更多活动线程来使内存带宽饱和，而在台式机处理器上，内核通常几乎可以使内存带宽饱和。

Edit: One more thought about VTune not classifying it as memory-bound. 编辑：再考虑一下VTune没有把它归类为内存限制。 This may be cause by the short computation time vs initialization. 这可能是由于计算时间短而初始化造成的。 Try to see what VTune says about the code in a loop. 试着看看VTune对循环中代码的看法。

使用openmp + SIMD没有加速

问题描述

1 个解决方案

解决方案1
3 2017-03-20 13:57:29

使用openmp + SIMD没有加速

问题描述

1 个解决方案

解决方案1 3 2017-03-20 13:57:29

解决方案1
3 2017-03-20 13:57:29