简体   繁体   English

如何优化一个简单的循环?

[英]How to optimize a simple loop?

The loop is simple循环很简单

void loop(int n, double* a, double const* b)
{
#pragma ivdep
    for (int i = 0; i < n; ++i, ++a, ++b)
        *a *= *b;
}

I am using intel c++ compiler and using #pragma ivdep for optimization currently.我正在使用 intel c++ 编译器并使用#pragma ivdep目前进行优化。 Any way to make it perform better like using multicore and vectorization together, or other techniques?有没有办法让它表现得更好,比如一起使用多核和矢量化,或其他技术?

Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.假设a指向的数据不能与b指向的数据重叠,那么为编译器提供最重要的信息以使其优化代码就是这个事实。

In older ICC version "restrict" was the only clean way to provide that key information to the compiler.在较旧的 ICC 版本中,“限制”是向编译器提供该关键信息的唯一干净方式。 In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).在较新的版本中,有一些更ivdep方法可以提供比ivdep更强大的保证(实际上, ivdep对优化器的承诺比它看起来要弱,通常没有预期的效果)。

But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.但是如果n很大,整个事情将被缓存未命中所支配,所以没有局部优化可以提供帮助。

  1. This loop is absolutely vectoriz able by compiler.这个循环绝对可以被编译器向量化。 But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor , whatever other techniques).但要确保循环实际上是矢量化的(使用 Compiler' -qopt-report5、汇编输出、 Intel (vectorization) Advisor或其他任何技术)。 One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it.一种更矫枉过正的方法是使用 -no-vec 选项创建性能基线(这将禁用 ivdep 驱动和自动矢量化),然后将执行时间与它进行比较。 This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.这不是检查矢量化存在的好方法,但它对于下一个项目符号的一般性能分析很有用。

If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it.如果循环实际上还没有被向量化,请确保你推送编译器来自动向量化它。 In order to push compiler see next bullet.为了推动编译器,请参阅下一个项目符号。 Note that next bullet could be useful even if loop was succesfully auto-vectorized.请注意,即使循环成功自动矢量化,下一个项目符号也可能很有用。

  1. To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you).要推动编译器对其进行矢量化,请使用:(a)限制关键字以“消除歧义” a 和 b 指针(有人已经向您建议了)。 (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). (b) #pragma omp simd (它具有比 ivdep 更便携和更灵活的额外好处,但也有一个缺点,即在英特尔编译器版本 14 之前的旧编译器中不受支持,并且对于其他循环更“危险”)。 To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.再次强调:给定的子弹似乎与 ivdep 做同样的事情,但根据不同的情况,它可能是更好、更强大的选择。

  2. Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication).给定循环具有细粒度迭代(每次迭代的计算量太小)并且总体上不是纯粹的计算绑定(因此 CPU 从/向缓存/内存加载/存储数据所花费的努力/周期如果不大于执行乘法所花费的努力/周期)。 Unrolling is often good way to slightly mitigate such disadvantages.展开通常是稍微减轻此类缺点的好方法。 But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll .但我建议使用#pragma unroll明确要求编译器展开它。 In fact, for certain compiler versions the unrolling will happen automatically.事实上,对于某些编译器版本,展开会自动发生。 Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor :同样,您可以使用 -qopt-report5、循环程序集或 Intel (Vectorization) Advisor来检查编译器何时执行此操作: 在此处输入图片说明

  3. In given loop you deal with "streaming" access pattern.在给定的循环中,您处理“流”访问模式。 Ie you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values).即,您正在连续地从/向内存加载/存储数据(并且缓存子系统对于大“n”值无济于事)。 So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end.因此,根据目标硬件、多线程(在 SIMD 之上)的使用等,您的循环最终可能会成为内存带宽限制。 Once you become memory bandwidth bound , you could use techniques like loop blocking, non-temporal stores, aggressive prefetching.一旦你成为内存带宽限制,你可以使用循环阻塞、非临时存储、积极预取等技术。 All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.所有这些技术都值得单独写一篇文章,尽管对于预取/NT 存储,您可以在英特尔编译器中使用一些编译指示。

  4. If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd , which will simulteneously thread-parallelize and vectorize the loop.如果 n 很大,并且您已经准备好应对内存带宽问题,您可以使用#pragma omp parallel for simd 之类的东西,这将同时线程并行化和矢量化循环。 However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually.然而,只有在非常新的编译器版本 AFAIK 中,此功能的质量才算不错,所以也许您更喜欢半手动拆分 n。 Ie n=n1xn2xn3 , where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization.n=n1xn2xn3 ,其中 n1 - 是在线程之间分配的迭代次数, n2 - 用于缓存阻塞, n3 - 用于矢量化。 Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).重写给定的循环,使其成为 3 个嵌套循环的循环嵌套,其中外循环有 n1 次迭代(并且应用了 #pragma omp parallel for),下一级循环有 n2 次迭代,n3 - 是最内层(其中应用了 #pragma omp simd)。

Some up to date links with syntax examples and more info :一些带有语法示例和更多信息的最新链接

Note1: I apologize that I don't provide various code snippets here. Note1:很抱歉我没有在这里提供各种代码片段。 There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours.在这里不提供它们至少有 2 个正当理由: 1. 我的 5 个子弹几乎适用于很多内核,而不仅仅是你的。 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version. 2. 另一方面,编译指示/手动重写技术的特定组合和相应的性能结果将因目标平台、ISA 和编译器版本而异。

Note2: Last comment regarding your GPU question.注2:关于您的 GPU 问题的最后评论。 Think of your loop vs. simple industry benchmarks like LINPACK or STREAM.想想您的循环与简单的行业基准(如 LINPACK 或 STREAM)。 In fact your loop could become somewhat very similar to some of them in the end.事实上,您的循环最终可能会变得与其中一些非常相似。 Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM.现在想想 x86 CPU,尤其是 LINPACK/STREAM 的英特尔至强融核平台特性。 They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen).它们确实非常好,并且会随着高带宽内存平台(如 Xeon Phi 2nd gen)变得更好。 So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe ).所以理论上没有任何单一的理由认为您的给定循环没有很好地映射到 x86 硬件的至少某些变体(请注意,我没有对Universe 中的任意内核说类似的话)。

I assume, that n is large.我假设n很大。 You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements.您可以通过启动k线程并为每个线程提供n/k元素来将工作负载分布在k CPU 上。 Use big chunks of consecutive data for each thread, don't do finegrained interleaving.每个线程使用大块连续数据,不要做细粒度的交织。 Try to align the chunks with cache lines.尝试将块与缓存行对齐。

If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results.如果您计划扩展到多个 NUMA 节点,请考虑将工作负载块显式复制到节点,线程在其上运行,然后将结果复制回。 In this case, it might not really help, because the workload for each step is very simple.在这种情况下,它可能没有真正的帮助,因为每个步骤的工作量都非常简单。 You'll have to run tests for that.你必须为此运行测试。

Loop unrolling manually is a simple way to optimize your code, and following is my code.手动循环展开是一种优化代码的简单方法,以下是我的代码。 Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'.原始loop费用618.48毫秒,而loop2在我的电脑费用381.10毫秒,编译器GCC是用选项“-02”。 I don't have Intel ICC to verify the code, but I think the optimization principles are the same.我没有Intel ICC来验证代码,但我认为优化原理是一样的。

Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled.同样,我做了一些实验,将两个程序的执行时间与两个内存块进行异或,其中一个程序在 SIMD 指令的帮助下进行矢量化,而另一个则是手动循环展开。 If you are interested, see here .如果您有兴趣,请看这里

PS Of course loop2 only works when n is even. PS 当然loop2只在 n 是偶数时才有效。

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

#define LEN 512*1024
#define times  1000

void loop(int n, double* a, double const* b){
    int i;
    for(i = 0; i < n; ++i, ++a, ++b)
        *a *= *b;
}

void loop2(int n, double* a, double const* b){
    int i;
    for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
        *a *= *b;
        *(a+1) *= *(b+1);
}


int main(void){
    double *la, *lb;
    struct timeval begin, end;
    int i;

    la = (double *)malloc(LEN*sizeof(double));
    lb = (double *)malloc(LEN*sizeof(double));
    gettimeofday(&begin, NULL);
    for(i = 0; i < times; ++i){
        loop(LEN, la, lb);
    }
    gettimeofday(&end, NULL);
    printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
            +(end.tv_usec-begin.tv_usec)/1000.0);

    gettimeofday(&begin, NULL);
    for(i = 0; i < times; ++i){
        loop2(LEN, la, lb);
    }
    gettimeofday(&end, NULL);
    printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
            +(end.tv_usec-begin.tv_usec)/1000.0);

    free(la);
    free(lb);
    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM