为什么c ++文件中函数的位置会影响其性能

Question

Why does the position of a function in a c++ file affect its performance? 为什么c ++文件中函数的位置会影响其性能？ Specifically in the example given below we have two identical functions that have different, consistent performance profiles. 具体来说，在下面给出的示例中，我们有两个相同的功能，具有不同的，一致的性能配 How does one go about investigating this and determining why the performance is so different? 如何调查这一点并确定性能如此不同？

The example is pretty straightforward in that we have two functions: a and b. 这个例子很简单，因为我们有两个函数：a和b。 Each is run many times in a tight loop and optimised ( -O3 -march=corei7-avx ) and timed. 每个都在紧密循环中运行多次并进行优化（ -O3 -march=corei7-avx ）并定时。 Here is the code: 这是代码：

#include <cstdint>
#include <iostream>
#include <numeric>

#include <boost/timer/timer.hpp>

bool array[] = {true, false, true, false, false, true};

uint32_t __attribute__((noinline)) a() {
    asm("");
    return std::accumulate(std::begin(array), std::end(array), 0);
}

uint32_t __attribute__((noinline)) b() {
    asm("");
    return std::accumulate(std::begin(array), std::end(array), 0);
}

const size_t WARM_ITERS = 1ull << 10;
const size_t MAX_ITERS = 1ull << 30;

void test(const char* name, uint32_t (*fn)())
{
    std::cout << name << ": ";
    for (size_t i = 0; i < WARM_ITERS; i++) {
        fn();
        asm("");
    }
    boost::timer::auto_cpu_timer t;
    for (size_t i = 0; i < MAX_ITERS; i++) {
        fn();
        asm("");
    }
}

int main(int argc, char **argv)
{
    test("a", a);
    test("b", b);
    return 0;
}

Some notable features: 一些值得注意的功能：

Function a and b are identical. 函数a和b是相同的。 They perform the same accumulate operation and compile down to the same assembly instructions. 它们执行相同的累积操作并编译为相同的汇编指令。
Each test iteration has a warm up period before the timing starts to try and eliminate any issues with warming up caches. 每个测试迭代都有一个预热时间，然后开始尝试消除任何加热缓存的问题。

When this is compiled and run we get the following output showing a is significantly slower than b: 当编译并运行时，我们得到以下输出，显示a明显慢于b：

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
a:  7.412747s wall, 7.400000s user + 0.000000s system = 7.400000s CPU (99.8%)
b:  5.729706s wall, 5.740000s user + 0.000000s system = 5.740000s CPU (100.2%)

If we invert the two tests (ie call test(b) and then test(a) ) a is still slower than b: 如果我们反转两个测试（即调用test(b)然后test(a) ）a仍然比b慢：

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
b:  5.733968s wall, 5.730000s user + 0.000000s system = 5.730000s CPU (99.9%)
a:  7.414538s wall, 7.410000s user + 0.000000s system = 7.410000s CPU (99.9%)

If we now invert the location of the functions in the C++ file (move the definition of b above a) the results are inverted and a becomes faster than b! 如果我们现在反转C ++文件中函数的位置（将b的定义移到a之上），结果将被反转，并且变得比b快！

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
a:  5.729604s wall, 5.720000s user + 0.000000s system = 5.720000s CPU (99.8%)
b:  7.411549s wall, 7.420000s user + 0.000000s system = 7.420000s CPU (100.1%)

So essentially whichever function is at the top of the c++ file is slower. 因此，基本上c ++文件顶部的任何函数都会变慢。

Some answers to questions you may have: 您可能遇到的问题的一些答案：

The code compiled is identical for both a and b. 编译的代码对于a和b都是相同的。 The disassembly has been checked. 已经检查了拆卸。 (For those interested: http://pastebin.com/2QziqRXR ) （对于那些感兴趣的人： http ： //pastebin.com/2QziqRXR ）
The code was compiled using gcc 4.8, gcc 4.8.1 on ubuntu 13.04, ubuntu 13.10, and ubuntu 12.04.03. 代码是使用gcc 4.8，gcc 4.8.1在ubuntu 13.04，ubuntu 13.10和ubuntu 12.04.03上编译的。
Effects observed on an Intel Sandy Bridge i7-2600 and Intel Xeon X5482 cpus. 在Intel Sandy Bridge i7-2600和Intel Xeon X5482 cpu上观察到的效果。

Why would this be happening? 为什么会这样？ What tools are available to investigate something like this? 有哪些工具可以调查这样的事情？

Answer 1

It looks to me like it's a cache aliasing issue. 它看起来像是一个缓存别名问题。

The test case is quite clever, and correctly loads everything into cache before timing it. 测试用例非常巧妙，并且在计时之前将所有内容正确加载到缓存中。 It looks like everything fits in cache - though simulated, I've verified this by looking at the output of valgrind's cachegrind tool, and as one would expect in such a small test case, there are no significant cache misses: 看起来所有东西都适合缓存 - 虽然是模拟的，但我已经通过查看valgrind的cachegrind工具的输出来验证这一点，正如人们在这么小的测试用例中所期望的那样，没有明显的缓存未命中：

valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64  /tmp/so
==11130== Cachegrind, a cache and branch-prediction profiler
==11130== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11130== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11130== Command: /tmp/so
==11130== 
--11130-- warning: L3 cache found, using its data for the LL simulation.
a:  6.692648s wall, 6.670000s user + 0.000000s system = 6.670000s CPU (99.7%)
b:  7.306552s wall, 7.280000s user + 0.000000s system = 7.280000s CPU (99.6%)
==11130== 
==11130== I   refs:      2,484,996,374
==11130== I1  misses:            1,843
==11130== LLi misses:            1,694
==11130== I1  miss rate:          0.00%
==11130== LLi miss rate:          0.00%
==11130== 
==11130== D   refs:        537,530,151  (470,253,428 rd   + 67,276,723 wr)
==11130== D1  misses:           14,477  (     12,433 rd   +      2,044 wr)
==11130== LLd misses:            8,336  (      6,817 rd   +      1,519 wr)
==11130== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==11130== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==11130== 
==11130== LL refs:              16,320  (     14,276 rd   +      2,044 wr)
==11130== LL misses:            10,030  (      8,511 rd   +      1,519 wr)
==11130== LL miss rate:            0.0% (        0.0%     +        0.0%  )

I picked a 32k, 8 way associative cache with a 64 byte cache line size to match common Intel CPUs, and saw the same discrepancy between the a and b function repeatedly. 我选择了一个32k，8路关联缓存，其64字节缓存行大小与常见的Intel CPU相匹配，并且反复看到a和b函数之间的差异。

Running on an imaginary machine with a 32k, 128 way associative cache with the same cache line size though, that difference all but goes away: 虽然在具有相同缓存行大小的32k，128路关联缓存的假想机器上运行，但这种差异几乎消失了：

valgrind --tool=cachegrind --I1=32768,128,64 --D1=32768,128,64  /tmp/so
==11135== Cachegrind, a cache and branch-prediction profiler
==11135== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11135== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11135== Command: /tmp/so
==11135== 
--11135-- warning: L3 cache found, using its data for the LL simulation.
a:  6.754838s wall, 6.730000s user + 0.010000s system = 6.740000s CPU (99.8%)
b:  6.827246s wall, 6.800000s user + 0.000000s system = 6.800000s CPU (99.6%)
==11135== 
==11135== I   refs:      2,484,996,642
==11135== I1  misses:            1,816
==11135== LLi misses:            1,718
==11135== I1  miss rate:          0.00%
==11135== LLi miss rate:          0.00%
==11135== 
==11135== D   refs:        537,530,207  (470,253,470 rd   + 67,276,737 wr)
==11135== D1  misses:           14,297  (     12,276 rd   +      2,021 wr)
==11135== LLd misses:            8,336  (      6,817 rd   +      1,519 wr)
==11135== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==11135== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==11135== 
==11135== LL refs:              16,113  (     14,092 rd   +      2,021 wr)
==11135== LL misses:            10,054  (      8,535 rd   +      1,519 wr)
==11135== LL miss rate:            0.0% (        0.0%     +        0.0%  )

Since in an 8 way cache, there are fewer spaces where potentially aliasing functions can hide, you get the addressing equivalent of more hash collisions. 由于在8路缓存中，可能存在混淆功能的空间较少，因此您可以获得相当于更多哈希冲突的寻址。 With the machine that has different cache associativity, in this instance you luck out with where things get placed in the object file, and so though not a cache miss , you also don't have to do any work to resolve which cache line you actually need. 对于具有不同缓存关联性的计算机，在这种情况下，您可以放心地将对象放置在目标文件中，因此尽管不是缓存未命中，您也无需执行任何工作来解析实际的缓存行需要。

Edit: more on cache associativity: http://en.wikipedia.org/wiki/CPU_cache#Associativity 编辑：有关缓存关联性的更多信息： http ： //en.wikipedia.org/wiki/CPU_cache#Associativity

Another edit: I've confirmed this with hardware event monitoring through the perf tool. 另一个编辑：我通过perf工具通过硬件事件监控确认了这一点。

I modified the source to call only a() or b() depending on whether there was a command line argument present. 我修改了源代码只调用a（）或b（），具体取决于是否存在命令行参数。 The timings are the same as in the original test case. 时间与原始测试用例相同。

sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so
a:  6.317755s wall, 6.300000s user + 0.000000s system = 6.300000s CPU (99.7%)
sudo perf report 

4K dTLB-loads
97 dTLB-load-misses
4K dTLB-stores
7 dTLB-store-misses
479 iTLB-loads
142 iTLB-load-misses

whereas 而

sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so foobar
b:  4.854249s wall, 4.840000s user + 0.000000s system = 4.840000s CPU (99.7%)
sudo perf report 

3K dTLB-loads
87 dTLB-load-misses
3K dTLB-stores
19 dTLB-store-misses
259 iTLB-loads
93 iTLB-load-misses

Showing that b has less TLB action, and so the cache doesn't have to be evicted. 显示b具有较少的TLB操作，因此不必逐出缓存。 Given that the functionality between the two is otherwise identical, it can only be explained by aliasing. 鉴于两者之间的功能在其他方面是相同的，它只能通过别名来解释。

Answer 2

You are calling a and b from test . 你正在test a和b 。 Since the compiler has no reason to reorder your two functions a is further away that b (in the original) from test . 由于编译器没有理由重新排序两个功能a是渐行渐远的是b从（原） test 。 You are also using templates so the actual code generation is is quite a bit bigger than what it looks in the C++ source. 您还使用模板，因此实际的代码生成比它在C ++源代码中看起来要大得多。

It is therefore quite possible that the instruction memory for b gets into the instruction cache together with test , a being further away does not get into the cache and therefore take longer to fetch from lower down caches or CPU main memory that b . 因此，很可能是对指令存储器b进入指令缓存连同test ， a是渐行渐远没有进入高速缓存，因此需要更长的时间，从低了下去缓存或CPU主内存中获取b 。

It is therefore possible that because of longer instruction fetch cycles for a than b , a runs slower than b even though the actual code is the same, it is just further away. 因此，这可能是因为较长的取指令周期为a比b ， a跑慢b即使实际的代码是一样的，它仅仅是渐行渐远。

Certain CPU architectures (such as arm cortex-A series) support performance counters that count the number of cache misses. 某些CPU架构（例如arm cortex-A系列）支持计算缓存未命中数的性能计数器。 Tools like perf , can capture this data when set to work with the appropriate performance counters. 像perf这样的工具可以在设置为使用适当的性能计数器时捕获此数据。

为什么c ++文件中函数的位置会影响其性能

问题描述

2 个解决方案

解决方案1
6 已采纳 2013-10-18 14:51:14

解决方案2
0 2013-10-18 13:59:41

为什么c ++文件中函数的位置会影响其性能

问题描述

2 个解决方案

解决方案1 6 已采纳 2013-10-18 14:51:14

解决方案2 0 2013-10-18 13:59:41

解决方案1
6 已采纳 2013-10-18 14:51:14

解决方案2
0 2013-10-18 13:59:41