Profiling my program with linux perf and different call graph modes gives different results

Question

I want to profile my c++ program with linux perf. For this I used the three following commands and I do not understand why I get three completely different reports.

perf record --call-graph dwarf ./myProg
perf report

perf record --call-graph fp ./myProg
perf report

perf record --call-graph lbr ./myProg
perf report

Also I do not understand why the main function is not the highest function in the list.

The logic of my program is the following, the main function calls the getPogDocumentFromFile function which calls fromPoxml which calls toPred which calls applySubst which calls subst . Moreover toPred , applySubst and subst are recursive functions. And I expect them to be the bottleneck.

Some more comments: my program runs about 25 minutes, it is highly recursive and allocates a lot (~17Go) of memory. Also I compile with -fno-omit-frame-pointer and use a recent intel CPU.

Any Idea?

EDIT:

Thinking again about my question, I realize that I do not understand the meaning of the Children column.

So far I assumed that the Self column was the percentage of samples with the function we are looking at at the top of the call stack and the Children column was the percentage of samples with the function anywhere in the call stack. Obviously this is not the case, otherwise the main function would have its children column not far from 100%. Maybe the callstack is truncated? Or am I completely misunderstanding how profilers work?

Answer 1

Man page of pref report documents the call chains display with children accumulation:

 --children Accumulate callchain of children to parent entry so that then can show up in the output. The output will have a new "Children" column and will be sorted on the data. It requires callchains are recorded. See the 'overhead calculation' section for more details. Enabled by default, disable with --no-children.

I can recommend you to try non-default mode with --no-children option of perf report (or perf top -g --no-children -p $PID_OF_PROGRAM )

So in default mode when there is some callchain data in perf.data file, perf report will calculate "self" and "self+children" overhead and sort on accumulated data. It means that if some function f1() has 10% of "self" samples and calls some leaf function f2() with 20% of "self" samples, then f1() self+children will be 30%. Accumulated data is for all stacks where current function was mentioned: for the work done in it itself, and work in all direct and indirect children (descendants).

You can specify some of call stack sampling method in --call-graph option (dwarf / lbr / fp), and they may have some limitations. Sometimes methods (especially fp) may fail to extract parts of call stack. -fno-omit-frame-pointer option may help, but when it is used in your executable but not in some library with callback, then call stack will be extracted partially. Some very long call chains may be not extracted too by some methods. Or perf report may fail to handle some cases.

To check for truncated call chain samples, use perf script|less somewhere in the middle. In this mode it does print every recorded sample with all detected function names, check for samples not ending with main and __libc_start_main - they are truncated.

otherwise the main function would have its children column not far from 100%

Yes, for single threaded program and correctly recorded and processed call stacks, main should have something like 99% in "Children" column. For multithreaded programs second and other threads will have another root node like start_thread.

Profiling my program with linux perf and different call graph modes gives different results

Question

1 answers

solution1
4 2020-01-03 04:21:36

Profiling my program with linux perf and different call graph modes gives different results

Question

1 answers

solution1 4 2020-01-03 04:21:36

solution1
4 2020-01-03 04:21:36