linux perf：如何解释和查找热点

Question

I tried out linux' perf utility today and am having trouble in interpreting its results.我今天尝试了 linux 的perf实用程序，但在解释其结果时遇到了麻烦。 I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.我已经习惯了 valgrind 的 callgrind，这当然是与基于采样的 perf 方法完全不同的方法。

What I did:我做了什么：

perf record -g -p $(pidof someapp)
perf report -g -n

Now I see something like this:现在我看到这样的东西：

+     16.92%  kdevelop  libsqlite3.so.0.8.6               [.] 0x3fe57                                                                                                              ↑
+     10.61%  kdevelop  libQtGui.so.4.7.3                 [.] 0x81e344                                                                                                             ▮
+      7.09%  kdevelop  libc-2.14.so                      [.] 0x85804                                                                                                              ▒
+      4.96%  kdevelop  libQtGui.so.4.7.3                 [.] 0x265b69                                                                                                             ▒
+      3.50%  kdevelop  libQtCore.so.4.7.3                [.] 0x18608d                                                                                                             ▒
+      2.68%  kdevelop  libc-2.14.so                      [.] memcpy                                                                                                               ▒
+      1.15%  kdevelop  [kernel.kallsyms]                 [k] copy_user_generic_string                                                                                             ▒
+      0.90%  kdevelop  libQtGui.so.4.7.3                 [.] QTransform::translate(double, double)                                                                                ▒
+      0.88%  kdevelop  libc-2.14.so                      [.] __libc_malloc                                                                                                        ▒
+      0.85%  kdevelop  libc-2.14.so                      [.] memcpy 
...

Ok, these functions might be slow, but how do I find out where they are getting called from?好的，这些函数可能很慢，但是我如何找出它们是从哪里调用的呢？ As all these hotspots lie in external libraries I see no way to optimize my code.由于所有这些热点都位于外部库中，因此我看不到优化代码的方法。

Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.基本上我正在寻找某种带有累积成本注释的调用图，其中我的函数比我调用的库函数具有更高的包容性采样成本。

Is this possible with perf?这可能与性能有关吗？ If so - how?如果是这样 - 如何？

Note: I found out that "E" unwraps the callgraph and gives somewhat more information.注意：我发现“E”打开了调用图并提供了更多信息。 But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where.但是调用图通常不够深和/或随机终止，而没有提供有关在哪里花费了多少信息的信息。 Example:例子：

-     10.26%  kate  libkatepartinterfaces.so.4.6.0  [.] Kate::TextLoader::readLine(int&...
     Kate::TextLoader::readLine(int&, int&)                                            
     Kate::TextBuffer::load(QString const&, bool&, bool&)                              
     KateBuffer::openFile(QString const&)                                              
     KateDocument::openFile()                                                          
     0x7fe37a81121c

Could it be an issue that I'm running on 64 bit?这可能是我在 64 位上运行的问题吗？ See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (I'm not using fedora but seems to apply to all 64bit systems).另请参阅： http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html （我没有使用 Fedora，但似乎适用于所有 64 位系统）。

Answer 1

With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:使用 Linux 3.7 perf 终于能够使用 DWARF 信息来生成调用图：

perf record --call-graph dwarf -- yourapp
perf report -g graph --no-children

Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html整洁，但与 VTune、KCacheGrind 或类似的相比，curses GUI 很糟糕......我建议尝试使用 FlameGraphs，这是一个非常简洁的可视化： http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Note: In the report step, -g graph makes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers.注意：在报告步骤中， -g graph使结果 output 易于理解“相对于总数”的百分比，而不是“相对于父级”的数字。 --no-children will show only self cost, rather than inclusive cost - a feature that I also find invaluable. --no-children将只显示自我成本，而不是包含成本——我也发现这个功能非常宝贵。

If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:如果你有一个新的性能和 Intel CPU，也可以试试 LBR unwinder，它有更好的性能并产生更小的结果文件：

perf record --call-graph lbr -- yourapp

The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.这里的缺点是调用堆栈深度与默认的 DWARF 展开器配置相比更加有限。

Answer 2

Ok, these functions might be slow, but how do I find out where they are getting called from?好的，这些函数可能很慢，但是我如何找出它们是从哪里调用的呢？ As all these hotspots lie in external libraries I see no way to optimize my code.由于所有这些热点都位于外部库中，因此我看不到优化代码的方法。

Are you sure that your application someapp is built with the gcc option -fno-omit-frame-pointer (and possibly its dependant libraries)?您确定您的应用程序someapp是使用 gcc 选项-fno-omit-frame-pointer （可能还有它的依赖库）构建的吗？ Something like this:像这样的东西：

g++ -m64 -fno-omit-frame-pointer -g main.cpp

Answer 3

You should give hotspot a try: https://www.kdab.com/hotspot-gui-linux-perf-profiler/你应该试试热点： https://www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot它在 github 上可用： https://github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.例如，它能够为您生成火焰图。

Answer 4

You can get a very detailed, source level report with perf annotate , see Source level analysis with perf annotate .您可以使用 perf annotate 获得非常详细的源级报告，请参阅使用perf annotate的源级分析。 It will look something like this (shamelessly stolen from the website):它看起来像这样（无耻地从网站上窃取）：

------------------------------------------------
 Percent |   Source code & Disassembly of noploop
------------------------------------------------
         :
         :
         :
         :   Disassembly of section .text:
         :
         :   08048484 <main>:
         :   #include <string.h>
         :   #include <unistd.h>
         :   #include <sys/time.h>
         :
         :   int main(int argc, char **argv)
         :   {
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
[...]
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
         :                           count++;
   14.22 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    $0x1,%eax
   14.78 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
         :           memcpy(&tv_end, &tv_now, sizeof(tv_now));
         :           tv_end.tv_sec += strtol(argv[1], NULL, 10);
         :           while (tv_now.tv_sec < tv_end.tv_sec ||
         :                  tv_now.tv_usec < tv_end.tv_usec) {
         :                   count = 0;
         :                   while (count < 100000000UL)
   14.78 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.23 :    8048541:       3d ff e0 f5 05          cmp    $0x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>
[...]

Don't forget to pass the -fno-omit-frame-pointer and the -ggdb flags when you compile your code.编译代码时不要忘记传递-fno-omit-frame-pointer和-ggdb标志。

Answer 5

Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering.除非您的程序具有很少的功能并且几乎不会调用系统 function 或 I/O，否则对程序计数器进行采样的分析器不会告诉您太多，正如您所发现的那样。 In fact, the well-known profiler gprof was created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).事实上，著名的分析器gprof是专门创建的，以尝试解决仅自我分析的无用性（并不是说它成功了）。

What actually works is something that samples the call stack (thereby finding out where the calls are coming from), on wall-clock time (thereby including I/O time), and report by line or by instruction (thereby pinpointing the function calls that you should investigate, not just the functions they live in).真正起作用的是对调用堆栈进行采样（从而找出调用来自何处），在挂钟时间（从而包括 I/O 时间），并按行或按指令报告（从而查明 function 调用您应该调查，而不仅仅是他们所在的功能）。

Furthermore, the statistic you should look for is percent of time on stack , not number of calls, not average inclusive function time.此外，您应该查找的统计数据是堆栈时间百分比，而不是调用次数，而不是平均包含 function 时间。 Especially not "self time".尤其不是“自我时间”。 If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save?如果调用指令（或非调用指令）有 38% 的时间在堆栈中，那么如果你可以摆脱它，你会节省多少？ 38%! 38%！ Pretty simple, no?很简单，不是吗？

An example of such a profiler is Zoom .此类分析器的一个示例是Zoom 。

There are more issues to be understood on this subject.在这个问题上还有更多的问题需要理解。

Added: @caf got me hunting for the perf info, and since you included the command-line argument -g it does collect stack samples.补充： perf让我寻找性能信息，并且由于您包含命令行参数-g它确实收集堆栈样本。 Then you can get a call-tree report.然后您可以获得调用树报告。 Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almost what you need.然后，如果您确保您在挂钟时间进行采样（因此您可以获得等待时间和 cpu 时间），那么您几乎得到了您需要的东西。

linux perf：如何解释和查找热点

问题描述

5 个解决方案

解决方案1
39 2012-12-17 23:20:17

解决方案2
18 2012-11-22 08:50:01

解决方案3
15 已采纳 2017-07-07 11:06:46

解决方案4
11 2013-10-17 23:37:57

解决方案5
5 2011-08-11 20:09:28

linux perf：如何解释和查找热点

问题描述

5 个解决方案

解决方案1 39 2012-12-17 23:20:17

解决方案2 18 2012-11-22 08:50:01

解决方案3 15 已采纳 2017-07-07 11:06:46

解决方案4 11 2013-10-17 23:37:57

解决方案5 5 2011-08-11 20:09:28

解决方案1
39 2012-12-17 23:20:17

解决方案2
18 2012-11-22 08:50:01

解决方案3
15 已采纳 2017-07-07 11:06:46

解决方案4
11 2013-10-17 23:37:57

解决方案5
5 2011-08-11 20:09:28