[英]Understanding Linux perf report output
Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report
command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all. 虽然我可以直观地得到大部分结果,但我很难完全理解perf report
命令的输出,特别是对于调用图的问题,所以我写了一个愚蠢的测试,一劳永逸地解决了我的这个问题。
I compiled what follows with: 我编写了以下内容:
gcc -Wall -pedantic -lm perf-test.c -o perf-test
No aggressive optimizations to avoid inlining and such. 没有积极的优化来避免内联等。
#include <math.h>
#define N 10000000UL
#define USELESSNESS(n) \
do { \
unsigned long i; \
double x = 42; \
for (i = 0; i < (n); i++) x = sin(x); \
} while (0)
void baz()
{
USELESSNESS(N);
}
void bar()
{
USELESSNESS(2 * N);
baz();
}
void foo()
{
USELESSNESS(3 * N);
bar();
baz();
}
int main()
{
foo();
return 0;
}
perf record ./perf-test
perf report
With these I get: 有了这些我得到:
94,44% perf-test libm-2.19.so [.] __sin_sse2
2,09% perf-test perf-test [.] sin@plt
1,24% perf-test perf-test [.] foo
0,85% perf-test perf-test [.] baz
0,83% perf-test perf-test [.] bar
Which sounds reasonable since the heavy work is actually performed by __sin_sse2
and sin@plt
is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N
iterations for foo
, 2*N
for the other two. 这听起来很合理,因为繁重的工作实际上是由__sin_sse2
执行的,而sin@plt
可能只是一个包装器,而我的函数的开销只考虑了循环,整体: foo
为3*N
次迭代,另一次为2*N
二。
perf record -g ./perf-test
perf report -G
perf report
Now the overhead columns that I get are two: Children
(the output is sorted by this one by default) and Self
(the same overhead of the flat profile). 现在我得到的开销列是两个: Children
(默认情况下输出由此排序)和Self
(平面配置文件的相同开销)。
Here is where I start feeling I miss something: regardless of the fact that I use -G
or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example: 这是我开始觉得我错过了一些东西:不管我使用-G
的事实,我无法用“x调用y”或“y调用x”来解释层次结构,例如:
without -G
("y is called by x"): 没有-G
(“y由x调用”):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 - __sin_sse2 + 43,67% foo + 41,45% main + 14,88% bar - 37,73% 0,00% perf-test perf-test [.] main main __libc_start_main - 23,41% 1,35% perf-test perf-test [.] foo foo main __libc_start_main - 6,43% 0,83% perf-test perf-test [.] bar bar foo main __libc_start_main - 0,98% 0,98% perf-test perf-test [.] baz - baz + 54,71% foo + 45,29% bar
__sin_sse2
is called by main
(indirectly?), foo
and bar
but not by baz
? 为什么__sin_sse2
由main
(间接?), foo
和bar
调用,而不是由baz
调用? baz
) and sometimes not (eg, the last instance of bar
)? 为什么函数有时会附加百分比和层次结构(例如, baz
的最后一个实例),有时不会(例如, bar
的最后一个实例)? with -G
("x calls y"): 与-G
(“x调用y”):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 + __sin_sse2 + __libc_start_main + main - 37,73% 0,00% perf-test perf-test [.] main - main + 62,05% foo + 35,73% __sin_sse2 2,23% sin@plt - 23,41% 1,35% perf-test perf-test [.] foo - foo + 64,40% __sin_sse2 + 29,18% bar + 3,98% sin@plt 2,44% baz __libc_start_main main foo
__sin_sse2
? 我应该如何解释__sin_sse2
下的前三个条目? main
calls foo
and that's ok, but why if it calls __sin_sse2
and sin@plt
(indirectly?) it does not also call bar
and baz
? main
调用foo
,这没关系,但为什么如果它调用__sin_sse2
和sin@plt
(间接?)它也不会调用bar
和baz
? __libc_start_main
and main
appear under foo
? 为什么__libc_start_main
和main
出现在foo
下? And why foo
appears twice? 为什么foo
出现两次? Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. 怀疑是这个层次结构有两个层次,其中第二层实际上代表“x调用y”/“y被x调用”语义,但我很难猜测所以我在这里问。 And the documentation doesn't seem to help. 文档似乎没有帮助。
Sorry for the long post but I hope that all this context may help or act as a reference for someone else too. 对于长篇文章感到抱歉,但我希望所有这些背景可能对其他人有帮助或作为参考。
Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms
DSO for reasons I don't understand -- relatively new to this myself. 好吧,让我们暂时忽略调用者和被调用者调用图之间的区别,主要是因为当我在我的机器上比较这两个选项之间的结果时,我只看到kernel.kallsyms
DSO中的效果,原因我不明白 - - 我自己比较新。
I found that for your example, it's a little easier to read the whole tree. 我发现,对于你的例子,读整个树更容易一些。 So, using --stdio
, let's look at the whole tree for __sin_sse2
: 所以,使用--stdio
,让我们看一下__sin_sse2
的整个树:
# Overhead Command Shared Object Symbol
# ........ ......... ................. ......................
#
94.72% perf-test libm-2.19.so [.] __sin_sse2
|
--- __sin_sse2
|
|--44.20%-- foo
| |
| --100.00%-- main
| __libc_start_main
| _start
| 0x0
|
|--27.95%-- baz
| |
| |--51.78%-- bar
| | foo
| | main
| | __libc_start_main
| | _start
| | 0x0
| |
| --48.22%-- foo
| main
| __libc_start_main
| _start
| 0x0
|
--27.84%-- bar
|
--100.00%-- foo
main
__libc_start_main
_start
0x0
So, the way I read this is: 44% of the time, sin
is called from foo
; 所以,我读这个的方式是:44%的时间, sin
从foo
调用; 27% of the time it's called from baz
, and 27% from bar. 27%的时间是从baz
打来的,27%来自酒吧。
The documentation for -g is instructive: -g的文档很有启发性:
-g [type,min[,limit],order[,key]], --call-graph
Display call chains using type, min percent threshold, optional print limit and order. type can be either:
· flat: single column, linear exposure of call chains.
· graph: use a graph tree, displaying absolute overhead rates.
· fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.
order can be either:
- callee: callee based call graph.
- caller: inverted caller based call graph.
key can be:
- function: compare on functions
- address: compare on individual code addresses
Default: fractal,0.5,callee,function.
The important piece here is that the default is fractal, and in fractal mode, each branch is a new object. 这里的重要部分是默认为分形,在分形模式下,每个分支都是一个新对象。
So, you can see that 50% of the time that baz
is called, it's called from bar
, and the other 50% it's called from foo
. 所以,你可以看到有50%的时间被调用baz
,它是从bar
调用的,另外50%是从foo
调用的。
This isn't always the most useful measure, so it's instructive to look at the results using -g graph
: 这并不总是最有用的度量,因此使用-g graph
查看结果是有益的:
94.72% perf-test libm-2.19.so [.] __sin_sse2
|
--- __sin_sse2
|
|--41.87%-- foo
| |
| --41.48%-- main
| __libc_start_main
| _start
| 0x0
|
|--26.48%-- baz
| |
| |--13.50%-- bar
| | foo
| | main
| | __libc_start_main
| | _start
| | 0x0
| |
| --12.57%-- foo
| main
| __libc_start_main
| _start
| 0x0
|
--26.38%-- bar
|
--26.17%-- foo
main
__libc_start_main
_start
0x0
This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar
is 26% of the total ticks (which in turn calls baz
), and foo->baz
(direct) is 12% of the total ticks. 这会改为使用绝对百分比,其中报告该调用链的每个百分比时间:因此foo->bar
是总滴答数的26%(反过来称为baz
),而foo->baz
(直接)为12%总滴答声。
I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2
. 从__sin_sse2
的角度来看,我仍然不知道为什么我看不出被调用者和调用者图表之间的任何差异。
One thing I did change from your command line is how the callgraphs were gathered. 我从命令行改变的一件事是如何收集调用图。 Linux perf by default uses the frame pointer method of reconstructing callstacks. Linux perf默认使用重构callstacks的帧指针方法。 This can be a problem when the compiler uses -fomit-frame-pointer
as a default . 当编译器使用-fomit-frame-pointer
作为默认值时,这可能是一个问题。 So I used 所以我用过
perf record --call-graph dwarf ./perf-test
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.