简体   繁体   English

linux perf:如何解释和查找热点

[英]linux perf: how to interpret and find hotspots

I tried out linux' perf utility today and am having trouble in interpreting its results.我今天尝试了 linux 的perf实用程序,但在解释其结果时遇到了麻烦。 I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.我已经习惯了 valgrind 的 callgrind,这当然是与基于采样的 perf 方法完全不同的方法。

What I did:我做了什么:

perf record -g -p $(pidof someapp)
perf report -g -n

Now I see something like this:现在我看到这样的东西:

+     16.92%  kdevelop  libsqlite3.so.0.8.6               [.] 0x3fe57                                                                                                              ↑
+     10.61%  kdevelop  libQtGui.so.4.7.3                 [.] 0x81e344                                                                                                             ▮
+      7.09%  kdevelop  libc-2.14.so                      [.] 0x85804                                                                                                              ▒
+      4.96%  kdevelop  libQtGui.so.4.7.3                 [.] 0x265b69                                                                                                             ▒
+      3.50%  kdevelop  libQtCore.so.4.7.3                [.] 0x18608d                                                                                                             ▒
+      2.68%  kdevelop  libc-2.14.so                      [.] memcpy                                                                                                               ▒
+      1.15%  kdevelop  [kernel.kallsyms]                 [k] copy_user_generic_string                                                                                             ▒
+      0.90%  kdevelop  libQtGui.so.4.7.3                 [.] QTransform::translate(double, double)                                                                                ▒
+      0.88%  kdevelop  libc-2.14.so                      [.] __libc_malloc                                                                                                        ▒
+      0.85%  kdevelop  libc-2.14.so                      [.] memcpy 

Ok, these functions might be slow, but how do I find out where they are getting called from?好的,这些函数可能很慢,但是我如何找出它们是从哪里调用的呢? As all these hotspots lie in external libraries I see no way to optimize my code.由于所有这些热点都位于外部库中,因此我看不到优化代码的方法。

Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.基本上我正在寻找某种带有累积成本注释的调用图,其中我的函数比我调用的库函数具有更高的包容性采样成本。

Is this possible with perf?这可能与性能有关吗? If so - how?如果是这样 - 如何?

Note: I found out that "E" unwraps the callgraph and gives somewhat more information.注意:我发现“E”打开了调用图并提供了更多信息。 But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where.但是调用图通常不够深和/或随机终止,而没有提供有关在哪里花费了多少信息的信息。 Example:例子:

-     10.26%  kate  libkatepartinterfaces.so.4.6.0  [.] Kate::TextLoader::readLine(int&...
     Kate::TextLoader::readLine(int&, int&)                                            
     Kate::TextBuffer::load(QString const&, bool&, bool&)                              
     KateBuffer::openFile(QString const&)                                              

Could it be an issue that I'm running on 64 bit?这可能是我在 64 位上运行的问题吗? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (I'm not using fedora but seems to apply to all 64bit systems).另请参阅: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (我没有使用 Fedora,但似乎适用于所有 64 位系统)。

With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:使用 Linux 3.7 perf 终于能够使用 DWARF 信息来生成调用图:

perf record --call-graph dwarf -- yourapp
perf report -g graph --no-children

Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html整洁,但与 VTune、KCacheGrind 或类似的相比,curses GUI 很糟糕......我建议尝试使用 FlameGraphs,这是一个非常简洁的可视化: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Note: In the report step, -g graph makes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers.注意:在报告步骤中, -g graph使结果 output 易于理解“相对于总数”的百分比,而不是“相对于父级”的数字。 --no-children will show only self cost, rather than inclusive cost - a feature that I also find invaluable. --no-children将只显示自我成本,而不是包含成本——我也发现这个功能非常宝贵。

If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:如果你有一个新的性能和 Intel CPU,也可以试试 LBR unwinder,它有更好的性能并产生更小的结果文件:

perf record --call-graph lbr -- yourapp

The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.这里的缺点是调用堆栈深度与默认的 DWARF 展开器配置相比更加有限。

Ok, these functions might be slow, but how do I find out where they are getting called from?好的,这些函数可能很慢,但是我如何找出它们是从哪里调用的呢? As all these hotspots lie in external libraries I see no way to optimize my code.由于所有这些热点都位于外部库中,因此我看不到优化代码的方法。

Are you sure that your application someapp is built with the gcc option -fno-omit-frame-pointer (and possibly its dependant libraries)?您确定您的应用程序someapp是使用 gcc 选项-fno-omit-frame-pointer (可能还有它的依赖库)构建的吗? Something like this:像这样的东西:

g++ -m64 -fno-omit-frame-pointer -g main.cpp

You should give hotspot a try: https://www.kdab.com/hotspot-gui-linux-perf-profiler/你应该试试热点: https://www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot它在 github 上可用: https://github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.例如,它能够为您生成火焰图。


You can get a very detailed, source level report with perf annotate , see Source level analysis with perf annotate .您可以使用 perf annotate 获得非常详细的源级报告,请参阅使用perf annotate源级分析 It will look something like this (shamelessly stolen from the website):它看起来像这样(无耻地从网站上窃取):

 Percent |   Source code & Disassembly of noploop
         :   Disassembly of section .text:
         :   08048484 <main>:
         :   #include <string.h>
         :   #include <unistd.h>
         :   #include <sys/time.h>
         :   int main(int argc, char **argv)
         :   {
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
         :                           count++;
   14.22 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    $0x1,%eax
   14.78 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
         :           memcpy(&tv_end, &tv_now, sizeof(tv_now));
         :           tv_end.tv_sec += strtol(argv[1], NULL, 10);
         :           while (tv_now.tv_sec < tv_end.tv_sec ||
         :                  tv_now.tv_usec < tv_end.tv_usec) {
         :                   count = 0;
         :                   while (count < 100000000UL)
   14.78 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.23 :    8048541:       3d ff e0 f5 05          cmp    $0x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>

Don't forget to pass the -fno-omit-frame-pointer and the -ggdb flags when you compile your code.编译代码时不要忘记传递-fno-omit-frame-pointer-ggdb标志。

Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering.除非您的程序具有很少的功能并且几乎不会调用系统 function 或 I/O,否则对程序计数器进行采样的分析器不会告诉您太多,正如您所发现的那样。 In fact, the well-known profiler gprof was created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).事实上,著名的分析器gprof是专门创建的,以尝试解决仅自我分析的无用性(并不是说它成功了)。

What actually works is something that samples the call stack (thereby finding out where the calls are coming from), on wall-clock time (thereby including I/O time), and report by line or by instruction (thereby pinpointing the function calls that you should investigate, not just the functions they live in).真正起作用的是对调用堆栈进行采样(从而找出调用来自何处),在挂钟时间(从而包括 I/O 时间),并按行或按指令报告(从而查明 function 调用您应该调查,而不仅仅是他们所在的功能)。

Furthermore, the statistic you should look for is percent of time on stack , not number of calls, not average inclusive function time.此外,您应该查找的统计数据是堆栈时间百分比,而不是调用次数,而不是平均包含 function 时间。 Especially not "self time".尤其不是“自我时间”。 If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save?如果调用指令(或非调用指令)有 38% 的时间在堆栈中,那么如果你可以摆脱它,你会节省多少? 38%! 38%! Pretty simple, no?很简单,不是吗?

An example of such a profiler is Zoom .此类分析器的一个示例是Zoom

There are more issues to be understood on this subject.在这个问题上还有更多的问题需要理解

Added: @caf got me hunting for the perf info, and since you included the command-line argument -g it does collect stack samples.补充: perf让我寻找性能信息,并且由于您包含命令行参数-g它确实收集堆栈样本。 Then you can get a call-tree report.然后您可以获得调用树报告。 Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almost what you need.然后,如果您确保您在挂钟时间进行采样(因此您可以获得等待时间和 cpu 时间),那么您几乎得到了您需要的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM