简体繁体 English

什么可能导致相同的SSE代码在同一个函数中运行速度慢几倍？

[英]What might cause the same SSE code to run a few times slower in the same function?

原文 2016-10-07 09:11:25 1 2 c++/ optimization/ intel/ sse/ vtune

Edit 3: The images are links to the full-size versions. 编辑3：图像是指向完整版本的链接。 Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table. 对于文本图片感到抱歉，但图表很难复制/粘贴到文本表中。

I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC : 对于使用icc --std=c++14 -qopenmp -axS -O3 -fPIC编译的程序，我有以下VTune配置文件icc --std=c++14 -qopenmp -axS -O3 -fPIC ：

In that profile, two clusters of instructions are highlighted in the assembly view. 在该配置文件中，组装视图中突出显示了两组指令。 The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. 尽管指令相同且顺序相同，但上部集群比下部集群花费的时间少得多。 Both clusters are located inside the same function and are obviously both called n times. 两个集群都位于同一个函数内，显然都被称为n次。 This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now). 每次我运行探查器时都会发生这种情况，我现在正在使用Westmere Xeon和Haswell笔记本电脑（使用SSE编译，因为这就是我现在正在瞄准和学习的东西）。

What am I missing? 我错过了什么？

Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine. 忽略糟糕的并发性，这很可能是由于笔记本电脑节流，因为它不会发生在桌面Xeon机器上。

I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior. 我认为这不是微优化的一个例子，因为这三个加在一起相当于总时间的百分之一，我真的对这种行为的可能原因感兴趣。

Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune... 编辑： OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...

Same profile, albeit with a slightly lower CPI this time. 相同的资料，虽然此次CPI略低。

2 个解决方案

Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. 那么，分析汇编代码请注意运行时间归因于下一条指令 - 因此，需要仔细解释您通过指令查找的数据。 There is a corresponding note in VTune Release Notes : “ VTune发行说明”中有相应的说明：

Running time is attributed to the next instruction (200108041) 运行时间归因于下一条指令（200108041）

To collect the data about time-consuming running regions of the target, the Intel® VTune™ Amplifier interrupts executing target threads and attributes the time to the context IP address. 为了收集有关目标耗时的运行区域的数据，英特尔®VTune™放大器会中断执行目标线程并将时间归因于上下文IP地址。

Due to the collection mechanism, the captured IP address points to an instruction AFTER the one that is actually consuming most of the time. 由于收集机制，捕获的IP地址指向大部分时间实际消耗的指令之后的指令。 This leads to the running time being attributed to the next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. 这导致运行时间归因于Assembly视图中的下一条指令（或很少与后续指令之一）。 In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line. 在极少数情况下，这也可能导致源中运行时间的错误归属 - 时间可能错误地归因于实际热线之后的源线。

In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to a different function in tightly inlined code. 如果内联模式为ON且程序在热点处内联函数较小，则可能导致运行时间归因于错误的函数，因为下一条指令可能属于紧密内联代码中的不同函数。

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs. HW perf计数器通常将停顿充电到必须等待其输入的指令，而不是缓慢产生输出的指令。

The inputs for your first group comes from your gather. 第一组的输入来自您的聚会。 This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. 这可能是缓存 - 错过了很多，并且成本不会被这些SUBPS / MULPS / ADDPS指令收费。 Their inputs come directly from vector loads of voxel[] , so store-forwarding failure will cause some latency. 它们的输入直接来自voxel[]向量加载，因此存储转发失败将导致一些延迟。 But that's only ~10 cycles IIRC, small compared to cache misses during the gather. 但这只是~10个周期的IIRC，与聚集期间的缓存未命中相比较小。 (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted) （那些缓存未命中在您突出显示的第一个组之前显示为指令的大条）

The inputs for your second group come directly from loads that can miss in cache. 第二组的输入直接来自可能在高速缓存中丢失的负载。 In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0] , which has a really large bar. 在第一组中，缓存未命中加载的直接使用者是诸如设置voxel[0]行的指令，其具有非常大的条。

But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. 但在第二组中， a_transfer[]缓存未命中的时间将归因于您突出显示的组。 Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready. 或者，如果它不是缓存未命中，那么可能是慢速地址计算，因为负载必须等待RAX准备好。

It looks like there's a lot you could optimize here . 它看起来像有很多，你可以在这里最优化 。

instead of store/reload for a_pointf , just keep it hot across loop iterations in a __m128 variable. 而不是为a_pointf存储/重新加载，只需在__m128变量中的循环迭代中保持热。 Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers). 如果您发现编译器在哪个向量寄存器溢出（如果它用完寄存器）方面做得不好，那么在C源中存储/重新加载才有意义。
calculate vi with _mm_cvttps_epi32(vf) , so the ROUNDPS isn't part of the dependency chain for the gather indices. 使用_mm_cvttps_epi32(vf)计算vi ，因此ROUNDPS不是聚集索引的依赖关系链的一部分。
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. voxel通过将窄负载拖入向量来收集自己，而不是编写复制到数组然后从中加载的代码。 (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki). （保证存储转发失败，请参阅Agner Fog的优化指南和x86标签wiki中的其他链接）。
It might be worth it to partially vectorize the address math (calculation of base_0 , using PMULDQ with a constant vector ), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.) 部分矢量化地址数学（计算base_0 ，使用带有常量向量的PMULDQ ）可能是值得的，所以不是存储/重载（~5个周期延迟），而是只有一个MOVQ或两个（~1或2个周期） Haswell的延迟，我忘了。）
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. 使用MOVD加载两个相邻的short值，并使用PINSRD将另一对合并到第二个元素中。 You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0) , except that pointer aliasing is undefined behaviour. 您可能会从_mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0)获得良好的代码，除了指针别名是未定义的行为。 You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0) . 您可能会从_mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0)获得更差的代码。
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS) . 然后使用PMOVSX将低四个16位元素扩展为32位元素整数，并将它们全部转换为与_mm_cvtepi32_ps （CVTDQ2PS）并行float 。
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway). 你的标量LERP不是自动矢量化的，但你并行做两个（并且可能保存一条指令，因为你想要结果在矢量中）。
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. 调用floorf()是愚蠢的，函数调用强制编译器将所有xmm寄存器溢出到内存中。 Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. 使用-ffast-math或其他内容编译以使其内联到ROUNDSS，或手动执行。 Especially since you go ahead and load the float that you calculate from that into a vector! 特别是因为你继续将你计算的浮点数加载到向量中！
Use a vector compare instead of scalar prev_x / prev_y / prev_z. 使用向量比较而不是标量prev_x / prev_y / prev_z。 Use MOVMASKPS to get the result into an integer you can test. 使用MOVMASKPS将结果转换为可以测试的整数。 (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps . See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html ). （您只关心较低的3个元素，因此请使用compare_mask & 0b0111测试（如果设置了4位掩码中的任何低3位，则在与_mm_cmpneq_ps不相等的比较后为_mm_cmpneq_ps 。请参阅double版本有关如何运作的更多表格的说明： http ： //www.felixcloutier.com/x86/CMPPD.html ）。