优化慢循环

Question

The code looks like this and the inner loop takes a huge amount of time: 代码看起来像这样，内部循环需要花费大量时间：

#define _table_derive                  ((double*)(Buffer_temp + offset))
#define Table_derive(m,nbol,pos)        _table_derive[(m) + 5*((pos) + _interval_derive_dIdQ * (nbol))]
char *Buffer_temp=malloc(...);

for (n_bol=0; n_bol<1400; n_bol++)  // long loop here
    [lots of code here, hundreds of lines with computations on doubles, other loops, etc]

    double ddI=0, ddQ=0;

    // This is the original code
    for(k=0; k< 100; k++ ) {
            ddI += Table_derive(2,n_bol,k);
            ddQ += Table_derive(3,n_bol,k);
    }
    ddI /= _interval_derive_dIdQ;
    ddQ /= _interval_derive_dIdQ;
    [more code here]
}

oprofile tells me that most of the runtime is spent here (2nd column is % of time): oprofile告诉我，大部分运行时都花在这里（第二列是时间的百分比）：

129304  7.6913 :for(k=0; k< 100; k++) {
275831 16.4070 :ddI += Table_derive(2,n_bol,k);
764965 45.5018 :ddQ += Table_derive(3,n_bol,k);

My first question is: can I rely on oprofile to indicate the proper place where the code is slow (I tried in -Og and -Ofast and it's basically the same). 我的第一个问题是：我可以依靠oprofile来指示代码缓慢的适当位置（我尝试使用-Og和-Ofast并且它基本相同）。

My second question is: how come this very simple loop is slower than sqrt, atan2 and the many hundred lines of computations that come before ? 我的第二个问题是：为什么这个非常简单的循环比sqrt，atan2和之前的数百行计算慢？ I know I'm not showing all the code, but there's lots of it and it doesn't make sense to me. 我知道我没有显示所有代码，但有很多代码，对我来说没有意义。

I've tried various optimizer tricks to either vectorize (doesn't work) or unroll (works) but for little gain, for instance: 我已经尝试过各种优化器技巧来矢量化（不起作用）或展开（工作），但收益微乎其微，例如：

    typedef double aligned_double __attribute__((aligned(8)));
    typedef const aligned_double* SSE_PTR;
    SSE_PTR TD=(SSE_PTR)&Table_derive(2,n_bol,0);   // We KNOW the alignement is correct because offset is multiple of 8

    for(k=0; k< 100; k++, TD+=5) {
        #pragma Loop_Optimize Unroll No_Vector
        ddI += TD[0];
        ddQ += TD[1];
    }

I've checked the output of the optimizer: "-Ofast -g -march=native -fopt-info-all=missed.info -funroll-loops" and in this case I get "loop unrolled 9 times", but if I try to vectorize, I get (in short): "can't force alignment of ref", "vector alignment may not be reachable", "Vectorizing an unaligned access", "Unknown alignment for access: *(prephitmp_3784 + ((sizetype) _1328 + (long unsigned int) (n_bol_1173 * 500) * 2) * 4)" 我检查了优化器的输出：“-Ofast -g -march = native -fopt-info-all = missed.info -funroll-loops”，在这种情况下，我得到“循环展开9次”，但如果我尝试矢量化，我得到（简而言之）：“不能强制对齐ref”，“矢量对齐可能无法访问”，“矢量化未对齐访问”，“未知对齐访问：*（prephitmp_3784 +（（sizetype））_1328 +（long unsigned int）（n_bol_1173 * 500）* 2）* 4）“

Any way to speed this up ? 有什么方法可以加快速度吗？

ADDENDUM: Thanks all for the comments, I'll try to answer here: 附录：感谢大家的评论，我将在这里回答：

yes, I know the code is ugly (it's not mine), and you haven't seen the actual original (that's a huge simplification) 是的，我知道代码是丑陋的（它不是我的），你还没有看到真正的原始（这是一个巨大的简化）
I'm stuck with this array as the C code is in a library and the large array, once processed and modified by the C, gets passed onto the caller (either IDL, Python or C). 我坚持使用这个数组，因为C代码在库中，大数组一旦被C处理和修改，就会被传递给调用者（IDL，Python或C）。
I know it would be better using some strucs instead of casting char* to complicated multidimensional double*, but see above. 我知道使用一些结构而不是将char *转换为复杂的多维double *会更好，但请参见上文。 Structs may not have been parts of C specs when this prog was first written (just kidding... maybe) 当这个编程首次编写时，结构可能不是C规范的一部分（只是开玩笑......也许）
I know that for the vectorizer it's better to have structs of arrays than arrays of struct, but, sigh... see above. 我知道对于矢量化器来说，最好是使用数组结构而不是结构数组，但是，叹气......见上文。
there's an actual outer loop (in the calling program), so that the total size of this monolithic array is around 2Gb 有一个实际的外部循环（在调用程序中），因此这个单片阵列的总大小约为2Gb
as is, it takes about 15 minutes to run with no optimization, and one minute after I rewrote some code (faster atan2, some manual aligns inside the array ...) and I used -Ofast and -march=native 因此，在没有优化的情况下运行大约需要15分钟，在我重写一些代码后一分钟（更快的atan2，一些手册在数组内对齐......）我使用-Ofast和-march = native
Due to constraints changes in the hardware, I'm trying to go faster to keep up with dataflow. 由于硬件的约束变化，我试图更快地跟上数据流。
I tried with Clang and the gains were slight (a few seconds), but I do not see an option to get an optimization report such as -fopt-info. 我尝试使用Clang并且收益很小（几秒钟），但我没有看到获得优化报告的选项，例如-fopt-info。 Do I have to look at the assembly as the only option to know what's going on ? 我是否必须将装配视为了解发生情况的唯一选择？
the system is a beastly 64-core with 500Gb of RAM, but I haven't been able to insert any OpenMP pragmas to parallelize the above code (I've tried): it reads a file, decompresses it entirely in memory (2Gb), analyses it in sequence (things like '+=') and spits out some results to the calling IDL/Python. 该系统是一个拥有500Gb RAM的64核的野兽，但是我无法插入任何OpenMP pragma来并行化上述代码（我已经尝试过）：它读取一个文件，在内存中完全解压缩（2Gb），按顺序分析它（比如'+ ='）并向调用IDL / Python吐出一些结果。 All on a single core (but the other cores are quite busy with the actual acquisition and post processing). 所有都在一个核心上（但其他核心非常忙于实际的采集和后期处理）。 :( :(
Useless, thanks for the excellent suggestion: removing ddQ += ... seems to transfer the % of time to the previous line: 376280 39.4835:ddI+=... 没用，谢谢你的优秀建议：删除ddQ + = ...似乎将％的时间转移到上一行：376280 39.4835：ddI + = ...
which brings us to even better: removing both (hence the entire loop) saves... nothing at all !!! 这让我们更好：删除两个（因此整个循环）节省......什么都没有！ So I guess as Peter said, I can't trust the profiler. 所以我猜彼得说，我不相信剖析器。 If I profile the loopless prog, I get timings more evenly spread out (previously 3 lines only above 1s, now about 10, all nonsensical like simple variables assign). 如果我描述无环编程，我会更均匀地分布时间（之前3行仅在1s以上，现在大约10，所有无意义的，如简单变量分配）。

I guess that inner loop was a red herring from the start; 我猜这个内环从一开始就是红鲱鱼; I'll restart my optimization using manual timings. 我将使用手动计时重新启动我的优化。 Thanks. 谢谢。

Answer 1

My first question is: can I rely on oprofile to indicate the proper place where the code is slow 我的第一个问题是：我可以依靠oprofile来指示代码缓慢的适当位置

Not precisely. 不准确。 As I understand it, cycles often get charged to the instruction that's waiting for inputs (or some other execution resource), not the instruction that's slow to produce inputs or free up whatever other execution resource. 据我所知，循环通常会被等待输入（或其他执行资源）的指令充电，而不是产生输入缓慢或释放任何其他执行资源的指令。

However, in your oprofile output, it's probable that it's actually that final loop. 但是，在您的oprofile输出中，它可能实际上是最终循环。 Are there other inner loops inside this outer loop? 这个外环内还有其他内环吗？

Did you profile cache misses? 您是否了解缓存未命中？ There are counters for many interesting things besides cycles. 除了周期之外，还有很多有趣的东西。

Also note that to really understand the performance, you need to look at profile annotations on the asm, not the C. eg it's weird that one add accounts for more of the time than the other, but that's probably just an issue of mapping insns to source lines. 还要注意，要真正理解性能，你需要查看asm上的配置文件注释，而不是C。例如，一个添加帐户的时间比另一个更多，这很奇怪，但这可能只是将insn映射到的问题。源线。

re: perf results from commenting out the loop: re：来自评论循环的perf结果：

So the program doesn't run any faster at all without that inner loop? 如果没有内部循环，程序根本不会运行得更快？ If the outer loop already touched that memory, maybe you're just bottlenecked on cache misses, and the inner loop was just touching that memory again? 如果外部循环已经触及那个内存，也许你只是在缓存未命中时遇到瓶颈，内部循环只是再次触及那个内存？ Try perf record -e L1-dcache-load-misses ./a.out then perf report . 尝试执行perf record -e L1-dcache-load-misses ./a.out然后执行perf report 。 Or the oprofile equivalent. 或者oprofile等价物。

Maybe the inner-loop uops were stuck waiting to issue until slow stuff in the outer loop retired. 也许内循环uops等待发出，直到外循环中的慢速东西退出。 The ReOrder Buffer (ROB) size in modern Intel CPUs is around 200 uops, and most insns decode to a single uop, so the out-of-order window is around 200 instructions. 现代Intel CPU中的ReOrder Buffer（ROB）大小约为200 uop，大多数insns解码为单个uop，因此无序窗口大约有200条指令。

Commenting out that inner loop also means that any loop-carried dependency chains in the outer loop don't have time to complete while the inner loop is running. 注释掉内部循环也意味着外循环中任何循环携带的依赖链在内循环运行时没有时间完成。 Removing that inner loop could produce a qualitative change in the bottleneck for the outer loop, from throughput to latency. 从吞吐量到延迟，删除内部循环可能会导致外部循环瓶颈的质量变化。

re: 15x faster with -Ofast -march=native . re：使用-Ofast -march=native快15倍。 Ok, that's good. 好的，那很好。 un-optimized code is horrible , and shouldn't be considered any kind of "baseline" or anything for performance. 未优化的代码是可怕的 ，不应被视为任何类型的“基线”或任何性能。 If you want to compare with something, compare with -O2 (doesn't include auto-vectorization, -ffast-math , or -march=native ). 如果要与某些东西进行比较，请与-O2进行比较（不包括自动矢量化， -ffast-math或-march=native ）。

Try using -fprofile-generate / -fprofile-use . 尝试使用-fprofile-generate / -fprofile-use 。 profile-use includes -funroll-loops , so I assume that option works best when there is profiling data available. profile-use包括-funroll-loops ，因此我假设当有可用的分析数据时，该选项最有效。

re: auto-parallelization: re：自动并行化：

You have to enable that specifically, either with OpenMP pragmas or with gcc options like -floop-parallelize-all -ftree-parallelize-loops=4 . 您必须使用OpenMP pragma或gcc选项（如-floop-parallelize-all -ftree-parallelize-loops=4专门启用它。 Auto-parallelization may not be possible if there are non-trivial loop-carried dependencies. 如果存在非平凡的循环携带依赖性，则可能无法进行自动并行化。 That wiki page is old, too, and might not reflect the state-of-the-art in auto-parallelization. 该维基页面也很旧，并且可能无法反映自动并行化的最新技术。 I think OpenMP hints about which loops to parallelize are a more sane way to go than having the compiler guess, esp. 我认为OpenMP暗示哪些循环并行化比编译器猜测更合理，特别是。 without -fprofile-use . 没有-fprofile-use 。

I tried with Clang and the gains were slight (a few seconds), but I do not see an option to get an optimization report such as -fopt-info. 我尝试使用Clang并且收益很小（几秒钟），但我没有看到获得优化报告的选项，例如-fopt-info。 Do I have to look at the assembly as the only option to know what's going on ? 我是否必须将装配视为了解发生情况的唯一选择？

The clang manual says you can use clang -Rpass=inline for a report on inlining. clang手册说你可以使用clang -Rpass=inline来获取内联报告。 The llvm docs say that the name for the vectorization pass is loop-vectorize , so you can use -Rpass-missed=loop-vectorize , or -Rpass-analysis=loop-vectorize to tell you which statement caused vectorization to fail. llvm文档说，矢量化传递的名称是loop-vectorize ，因此您可以使用-Rpass-missed=loop-vectorize或-Rpass-analysis=loop-vectorize来告诉您哪个语句导致矢量化失败。

Looking at the asm is the only way to know whether it auto-vectorized badly or not, but to really judge the compiler's work you have to know how to write efficient asm yourself (so you know approximately what it could have done.) See http://agner.org/optimize/ , and other links in the x86 tag wiki. 望着ASM是知道它是否自动向量化严重与否的唯一途径，但要真正判断编译器的工作，你必须知道如何编写高效的ASM自己（这样你大致知道它可以做。）查看HTTP ：//agner.org/optimize/ ，以及x86标记wiki中的其他链接。

I didn't try putting your code on http://gcc.godbolt.org/ to try it with different compilers, but you could post a link if your example makes asm that's representative of what you see from the full source. 我没有尝试将您的代码放在http://gcc.godbolt.org/上，尝试使用不同的编译器，但如果您的示例使asm代表您从完整源代码中看到的内容，则可以发布链接。

Auto-vectorization 自动向量化

for(k=0; k< 100; k++ ) {
        ddI += Table_derive(2,n_bol,k);
        ddQ += Table_derive(3,n_bol,k);
}

This should auto-vectorize, since 2 and 3 are consecutive elements. 这应该自动矢量化，因为2和3是连续元素。 You would get better cache locality (for this part) if you split up the table into multiple tables. 如果将表拆分为多个表，则可以获得更好的缓存局部性（对于此部分）。 eg keep elements 2 and 3 of each group of 5 in one array. 例如，将每组5的元素2和3保持在一个阵列中。 Group other elements that are used together into tables. 将一起使用的其他元素分组到表中。 (If there's overlap, eg another loop needs elements 1 and 3, then maybe split up the one that can't auto-vectorize anyway?) （如果有重叠，例如另一个循环需要元素1和3，那么可能会拆分无法自动向量化的那个？）

re: question update: You don't need a struct-of-arrays for this to auto-vectorize with SSE. re：问题更新：您不需要使用SSE自动向量化的数组结构。 A 16B vector holds exactly two double s, so the compiler can accumulate a vector of [ ddI ddQ ] with addsd . 16B向量恰好保持两个double s，因此编译器可以使用addsd累加[ ddI ddQ ]的addsd 。 With AVX 256b vectors, it would have to do a vmovupd / vinsertf128 to get that pair of double s from adjacent structs, instead of a single 256b load, though, but not a big deal. 使用AVX 256b向量，它必须做一个vmovupd / vinsertf128才能从相邻的结构中获得那对double s，而不是单个256b负载，但是并不是什么大问题。 Memory locality is an issue, though; 但是，记忆位置是一个问题; you're only using 2 out of every 5 double s in the cache lines you touch. 你只使用你触摸的缓存行中每5个double s中的2个。

It should probably auto-vectorize even without -ffast-math , as long as you're targeting a CPU with double-precision vectors. 即使没有-ffast-math ，它也应该是自动向量化的，只要你的目标是具有双精度向量的CPU。 (eg x86-64, or 32bit with -msse2 ). （例如x86-64，或带-msse2 32位）。

gcc likes to make big prologues for potentially-unaligned data, using scalar until it reaches an aligned address. gcc喜欢使用标量为可能未对齐的数据做大的序言，直到达到对齐的地址。 This leads to bloated code, esp. 这会导致代码膨胀，尤其是 with 256b vectors and small elements. 256b向量和小元素。 It shouldn't be too bad with double , though. 不过， double不应该太糟糕。 Still, give clang 3.7 or clang 3.8 a try. 仍然，尝试clang 3.7或clang 3.8。 clang auto-vectorizes potentially-unaligned accesses with unaligned loads, which have no extra cost when the data is aligned at runtime. clang使用未对齐的加载自动向量化可能未对齐的访问，在运行时对齐数据时没有额外的成本。 (gcc optimizes for the hopefully-rare case where the data isn't aligned, because unaligned loads/store instructions were slower on old CPUs (eg Intel pre-Nehalem) even when used on aligned data.) （gcc针对数据未对齐的希望极少数情况进行了优化，因为即使在对齐数据上使用时，旧CPU（例如Intel pre-Nehalem）上的未对齐加载/存储指令也较慢。）

your char array may be defeating the auto-vectorizer, if it can't prove that each double is even 8B-aligned. 你的char数组可能会击败自动矢量化器，如果它不能证明每个double精度甚至是8B对齐的。 Like @JohnBollinger commented, that's really ugly. 就像@JohnBollinger评论的那样，这真的很难看。 If you have an array of structs of 5 doubles, declare it that way! 如果你有5个双打的结构数组，那就这样声明！

How to write it as an array of structs: 如何将其编写为结构数组：

Keep the "manual" multidimensional indexing, but make the base 1D array an array of double , or better of a struct type, so the compiler will assume every double is 8B-aligned. 保持“手动”多维索引，但使基本1D数组成为double或更好struct类型的数组，因此编译器将假设每个double都是8B对齐的。

Your original version also referenced the global Buffer_temp for every access to the array. 您的原始版本还为每次访问阵列引用了全局Buffer_temp 。 (Or was it a local?) Any store that might alias it would require re-loading the base pointer. （或者它是本地的吗？）任何可能对其进行别名的商店都需要重新加载基本指针。 (C's aliasing rules allow char* to alias anything, but I think your cast to a double* before dereferencing saves you from that. You're not storing to the array inside the inner loop anyway, but I assume you are in the outer array.) （C的别名规则允许char*别名任何东西，但是我认为在解除引用之前你的转换为double*保存你。无论如何你都没有存储到内部循环内的数组，但我假设你在外部数组中。）

typedef struct table_derive_entry {
    double a,b,c,d,e;
} derive_t;

void foo(void)
{
    // I wasn't clear on whether table is static/global, or per-call scratch space.
    derive_t *table = aligned_alloc(foo*bar*sizeof(derive_t), 64);            // or just malloc, or C99 variable size array.

    // table += offset/sizeof(table[0]);   // if table is global and offset is fixed within one call...

// maybe make offset a macro arg, too?
#define Table_derive(nbol, pos)     table[offset/sizeof(derive_t) + (pos) + _interval_derive_dIdQ / sizeof(derive_t) * (nbol))]


    // ...        
    for(k=0; k< 100; k++ ) {
         ddI += Table_derive(n_bol, k).b;
         ddQ += Table_derive(n_bol, k).c;
    }
    // ...
}
#undef Table_derive

If _interval_derive_dIdQ and offset aren't always multiples of 5 * 8B, then you may need to declare double *table = ...; 如果_interval_derive_dIdQ和offset不总是5 * 8B的倍数，那么你可能需要声明double *table = ...; and modify your Table_derive to something like 并修改你的Table_derive

#define Table_derive(nbol, pos)   ( ((derive_t *)(double_table + offset/sizeof(double) + _interval_derive_dIdQ / sizeof(double) * (nbol)))[pos] )

FP division: FP部门：

ddI /= _interval_derive_dIdQ;
ddQ /= _interval_derive_dIdQ;

Can you hoist double inv_interval_derive_dIdQ = 1.0 / _interval_derive_dIdQ; 你可以提升double inv_interval_derive_dIdQ = 1.0 / _interval_derive_dIdQ; out of the loop? 跳出循环？ Multiply is significantly cheaper than divide, esp. 乘法比分数便宜得多，尤其是。 if latency matters or the div unit is also needed for sqrt. 如果延迟很重要或者sqrt也需要div单位。

优化慢循环

问题描述

1 个解决方案

解决方案1
3 2016-04-28 01:10:48

re: perf results from commenting out the loop: re：来自评论循环的perf结果：

re: auto-parallelization: re：自动并行化：

Auto-vectorization 自动向量化

How to write it as an array of structs: 如何将其编写为结构数组：

FP division: FP部门：

优化慢循环

问题描述

1 个解决方案

解决方案1 3 2016-04-28 01:10:48

re: perf results from commenting out the loop: re：来自评论循环的perf结果：

re: auto-parallelization: re：自动并行化：

Auto-vectorization 自动向量化

How to write it as an array of structs: 如何将其编写为结构数组：

FP division: FP部门：

解决方案1
3 2016-04-28 01:10:48