使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果

Question

I want to profile my c++ program with linux perf.我想用 linux perf 来分析我的 c++ 程序。 For this I used the three following commands and I do not understand why I get three completely different reports.为此，我使用了以下三个命令，但我不明白为什么我会得到三个完全不同的报告。

perf record --call-graph dwarf ./myProg
perf report

perf record --call-graph fp ./myProg
perf report

perf record --call-graph lbr ./myProg
perf report

Also I do not understand why the main function is not the highest function in the list.我也不明白为什么main函数不是列表中的最高函数。

The logic of my program is the following, the main function calls the getPogDocumentFromFile function which calls fromPoxml which calls toPred which calls applySubst which calls subst .我的程序的逻辑如下， main函数调用getPogDocumentFromFile函数，该函数调用fromPoxml调用toPred调用applySubst调用subst 。 Moreover toPred , applySubst and subst are recursive functions.此外， toPred 、 applySubst和subst是递归函数。 And I expect them to be the bottleneck.我希望它们成为瓶颈。

Some more comments: my program runs about 25 minutes, it is highly recursive and allocates a lot (~17Go) of memory.更多评论：我的程序运行大约 25 分钟，它是高度递归的并分配了大量（~17Go）内存。 Also I compile with -fno-omit-frame-pointer and use a recent intel CPU.此外，我使用-fno-omit-frame-pointer编译并使用最新的英特尔 CPU。

Any Idea?任何的想法？

EDIT:编辑：

Thinking again about my question, I realize that I do not understand the meaning of the Children column.再次思考我的问题，我意识到我不明白儿童专栏的含义。

So far I assumed that the Self column was the percentage of samples with the function we are looking at at the top of the call stack and the Children column was the percentage of samples with the function anywhere in the call stack.到目前为止，我假设 Self 列是我们在调用堆栈顶部查看的函数的样本百分比，而 Children 列是调用堆栈中任何位置的函数的样本百分比。 Obviously this is not the case, otherwise the main function would have its children column not far from 100%.显然情况并非如此，否则主函数的子列将离 100% 不远。 Maybe the callstack is truncated?也许调用堆栈被截断了？ Or am I completely misunderstanding how profilers work?还是我完全误解了分析器的工作原理？

Answer 1

Man page of pref report documents the call chains display with children accumulation:首选项pref report页记录了调用链显示的子项累积：

 --children Accumulate callchain of children to parent entry so that then can show up in the output. The output will have a new "Children" column and will be sorted on the data. It requires callchains are recorded. See the 'overhead calculation' section for more details. Enabled by default, disable with --no-children.

I can recommend you to try non-default mode with --no-children option of perf report (or perf top -g --no-children -p $PID_OF_PROGRAM )我可以建议您使用perf report --no-children选项尝试非默认模式（或perf top -g --no-children -p $PID_OF_PROGRAM ）

So in default mode when there is some callchain data in perf.data file, perf report will calculate "self" and "self+children" overhead and sort on accumulated data.所以在默认模式下，当 perf.data 文件中有一些调用链数据时，perf report 会计算“self”和“self+children”的开销并对累积的数据进行排序。 It means that if some function f1() has 10% of "self" samples and calls some leaf function f2() with 20% of "self" samples, then f1() self+children will be 30%.这意味着如果某个函数f1()有 10% 的“self”样本并调用了一些带有 20%“self”样本的叶函数f2() ，那么f1() self+children 将是 30%。 Accumulated data is for all stacks where current function was mentioned: for the work done in it itself, and work in all direct and indirect children (descendants).累积数据适用于提及当前函数的所有堆栈：用于在其本身完成的工作，以及在所有直接和间接子代（后代）中的工作。

You can specify some of call stack sampling method in --call-graph option (dwarf / lbr / fp), and they may have some limitations.您可以在--call-graph选项（dwarf / lbr / fp）中指定一些调用堆栈采样方法，它们可能有一些限制。 Sometimes methods (especially fp) may fail to extract parts of call stack.有时方法（尤其是 fp）可能无法提取部分调用堆栈。 -fno-omit-frame-pointer option may help, but when it is used in your executable but not in some library with callback, then call stack will be extracted partially. -fno-omit-frame-pointer选项可能会有所帮助，但是当它在您的可执行文件中使用而不是在某些带有回调的库中时，调用堆栈将被部分提取。 Some very long call chains may be not extracted too by some methods.一些很长的调用链可能不会被某些方法提取出来。 Or perf report may fail to handle some cases.或者perf report可能无法处理某些情况。

To check for truncated call chain samples, use perf script|less somewhere in the middle.要检查截断的调用链样本，请在中间的某个地方使用perf script|less 。 In this mode it does print every recorded sample with all detected function names, check for samples not ending with main and __libc_start_main - they are truncated.在这种模式下，它会使用所有检测到的函数名称打印每个记录的样本，检查不以main和__libc_start_main结尾的样本 - 它们被截断。

otherwise the main function would have its children column not far from 100%否则主函数的子列就会离 100% 不远

Yes, for single threaded program and correctly recorded and processed call stacks, main should have something like 99% in "Children" column.是的，对于单线程程序和正确记录和处理的调用堆栈， main在“Children”列中应该有 99% 之类的东西。 For multithreaded programs second and other threads will have another root node like start_thread.对于多线程程序，第二个和其他线程将有另一个根节点，如 start_thread。

使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果

问题描述

1 个解决方案

解决方案1
4 2020-01-03 04:21:36

使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果

问题描述

1 个解决方案

解决方案1 4 2020-01-03 04:21:36

解决方案1
4 2020-01-03 04:21:36