如何获得涉及 C++ 标准库的帧指针性能调用堆栈/火焰图？

Question

I like the fp method for collecting call stacks with perf record since it's lightweight and less complex than dwarf .我喜欢使用perf record收集调用堆栈的fp方法，因为它比dwarf轻巧且简单。 However, when I look at the call stacks/flamegraphs I get when a program uses the C++ standard library, they are not correct.但是，当我查看程序使用 C++ 标准库时得到的调用堆栈/火焰图时，它们是不正确的。

Here is a test program:这是一个测试程序：

#include <algorithm>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

int __attribute__((noinline)) stupid_factorial(int x) {
    std::vector<std::string> xs;
    // Need to convert numbers to strings or it will all get inlined
    for (int i = 0; i < x; ++i) {
        std::stringstream ss;
        ss << std::setw(4) << std::setfill('0') << i;
        xs.push_back(ss.str());
    }
    int res = 1;
    while(std::next_permutation(xs.begin(), xs.end())) {
        res += 1;
    };
    return res;
}

int main() {
    std::cout << stupid_factorial(11) << "\n";
}

And here is the flame graph:这是火焰图：

It was generated by the following steps on Ubuntu 20.04 in a Docker container:它是在 Docker 容器中的 Ubuntu 20.04 上通过以下步骤生成的：

g++ -Wall -O3 -g -fno-omit-frame-pointer program.cpp -o 6_stl.bin
# Make sure you have libc6-prof and libstdc++6-9-dbg installed
env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp -- ./6_stl.bin
# Make sure you have https://github.com/jonhoo/inferno installed
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

The main thing that's wrong with this is that not all functions are children of stupid_factorial , eg __memcmp_avx2_movbe .主要的错误在于并非所有函数都是stupid_factorial ，例如__memcmp_avx2_movbe 。 With dwarf , they are.对于dwarf ，他们是。 In more complex programs, I have even seen functions like these being outside main .在更复杂的程序中，我什至看到这样的函数在main之外。 __dynamic_cast for instance is one that often has no parent.例如__dynamic_cast是一种通常没有父级的。

In gdb , I always see correct backtraces, including for the functions that do not appear correctly here.在gdb ，我总是看到正确的回溯，包括此处未正确显示的函数。 Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?是否可以使用libstdc++获得正确的fp调用堆栈而无需自己编译（这似乎需要做很多工作）？

There are also other oddities, though I couldn't reproduce them in Ubuntu 18.04 (outside the Docker container):还有其他一些奇怪的地方，尽管我无法在 Ubuntu 18.04 中（在 Docker 容器之外）重现它们：

There is an unresolved function in libstdc++.so.6.28 . libstdc++.so.6.28有一个未解析的函数。
There is an unresolved function in my own binary, 6_stl.bin , on the very left.在我自己的二进制文件6_stl.bin的最左边有一个未解析的函数。 This is also the case with dwarf .这也是dwarf的情况。

Answer 1

With your code, 20.04 x86_64 ubuntu, perf record --call-graph fp with and without -e cycles:u I have similar flamegraph as viewed with https://speedscope.app (prepare data with perf script > out.txt and select out.txt in the webapp).使用您的代码，20.04 x86_64 ubuntu， perf record --call-graph fp带和不带-e cycles:u我有类似的火焰图，与使用https://speedscope.app查看（使用perf script > out.txt准备数据perf script > out.txt并选择webapp 中的 out.txt）。

Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?是否可以使用 libstdc++ 获得正确的 fp 调用堆栈而无需自己编译（这似乎需要做很多工作）？

No, call-graph method 'fp' is implemented in linux kernel code in very simple way: https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - https://elixir.bootlin.com/linux/v5.4/source/arch/x86/events/core.c#L2464不，调用图法“FP” Linux内核代码中实现非常简单的方法： https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - HTTPS：//elixir.bootlin。 com/linux/v5.4/source/arch/x86/events/core.c#L2464

perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
{ 
    ...
    fp = (unsigned long __user *)regs->bp;
    perf_callchain_store(entry, regs->ip);
    ...
    // where max_stack is probably around 127 = PERF_MAX_STACK_DEPTH     https://elixir.bootlin.com/linux/v5.4/source/include/uapi/linux/perf_event.h#L1021
    while (entry->nr < entry->max_stack) {
        ...
        if (!valid_user_frame(fp, sizeof(frame)))
            break;
        bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
        bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp));

        perf_callchain_store(entry, frame.return_address);
        fp = (void __user *)frame.next_frame;
    }
}

It can't find correct frames for -fomit-frame-pointer compiled code.它无法为 -fomit-frame-pointer 编译代码找到正确的帧。

For incorrect call stacks with main -> __memcmp_avx2_movbe there is only call stack data generated by kernel in perf.data file, no copy of user stack fragment, no register data:对于 main -> __memcmp_avx2_movbe 不正确的调用堆栈，在 perf.data 文件中只有内核生成的调用堆栈数据，没有用户堆栈片段的副本，没有寄存器数据：

setarch x86_64 -R env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp  -- ./6_stl.bin
perf script -D | less

869122666352078 0xae0 [0x58]: PERF_RECORD_SAMPLE(IP, 0x4002): 12267/12267: 0x7ffff7d51670 period: 2332683 addr: 0
... FP chain: nr:5
.....  0: fffffffffffffe00
.....  1: 00007ffff7d51670
.....  2: 0000555555556452
.....  3: 00007ffff7be90fb
.....  4: 00005555555564de
 ... thread: 6_stl.bin:12267
 ...... dso: /usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so
6_stl.bin 12267 869122.666352:    2332683 cycles: 
            7ffff7d51670 __memcmp_avx2_movbe+0x140 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            555555556452 main+0x12 (/home/user/so/68259699/6_stl.bin)
            7ffff7be90fb __libc_start_main+0x10b (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            5555555564de _start+0x2e (/home/user/so/68259699/6_stl.bin)

So, with this method user-space perf tool can't use any additional information to fix the call stack.因此，使用这种方法用户空间性能工具不能使用任何附加信息来修复调用堆栈。 With dwarf method there are registers and partial dump of user stack data on every sample event.使用 dwarf 方法，每个样本事件都有寄存器和用户堆栈数据的部分转储。

Gdb has full access to live process and can use any information, all registers, read any amount of user process stack, read additional debug info for program and libraries. Gdb 拥有对实时进程的完全访问权限，可以使用任何信息、所有寄存器、读取任意数量的用户进程堆栈、读取程序和库的附加调试信息。 And doing advanced and slow backtrace in gdb is not limited by time or security or uninterruptible context.并且在 gdb 中进行高级和慢速回溯不受时间或安全性或不间断上下文的限制。 Linux kernel should record perf sample in small time, it can't access swapped data or debug sections or debug info files, it should not do complex parsing (which can have some bugs). Linux内核应该在短时间内记录性能样本，它不能访问交换数据或调试部分或调试信息文件，它不应该做复杂的解析（可能有一些错误）。

Debug version of libstdc++ may help ( sudo apt install libstdc++6-9-dbg ), but it is slow. libstdc++ 的调试版本可能会有所帮助（ sudo apt install libstdc++6-9-dbg ），但速度很慢。 And it did not help me to find lost backtrace of this asm-implemented __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S )它并没有帮助我找到这个 asm 实现的 __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S ) 丢失的回溯

If you want full backtrace, I think you should find how to recompile a world (or only all libraries used by your target application).如果你想要完整的回溯，我认为你应该找到如何重新编译一个世界（或者只有你的目标应用程序使用的所有库）。 Probably it will be easier not with Ubuntu but with something like gentoo or arch or apline?可能不使用 Ubuntu 而使用 gentoo 或 arch 或 apline 之类的东西会更容易？

If you are interested only in performance why do you want the flamegraph?如果您只对性能感兴趣，为什么要使用火焰图？ Flat profile will catch most performance data;平面轮廓将捕获大多数性能数据； non-ideal flamegraph can be useful too.非理想火焰图也很有用。

Answer 2

When you look at the source code for the __memcmp_avx2_movbe function , you see that it doesn't have a function prologue .当您查看__memcmp_avx2_movbe函数的源代码时，您会发现它没有函数 prologue 。

Therefore, we should expect the immediate parent frame of __memcmp_avx2_movbe to be skipped in the backtrace.因此，我们应该期望在回溯中跳过__memcmp_avx2_movbe的直接父帧。 The innermost frame will still be correctly identified as __memcmp_avx2_movbe from the instruction pointer, but the return address on the stack that is identified by the frame pointer will belong to the grandparent.最里面的帧仍然会被指令指针正确识别为__memcmp_avx2_movbe ，但由帧指针识别的堆栈上的返回地址将属于祖父。

When the stupid_factorial function is the parent of __memcmp_avx2_movbe (because all intermediate functions between those two are inlined), that could explain the primary issue from the question.当stupid_factorial函数是stupid_factorial的父__memcmp_avx2_movbe （因为这两者之间的所有中间函数都是内联的），这可以解释问题的主要问题。 The other issues are resolved by using a libstdc++ compiled with frame pointers as described here .其他问题通过使用使用帧指针编译的libstdc++解决，如here所述。

如何获得涉及 C++ 标准库的帧指针性能调用堆栈/火焰图？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-07-06 13:04:15

解决方案2
1 2021-07-16 16:46:46

如何获得涉及 C++ 标准库的帧指针性能调用堆栈/火焰图？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-07-06 13:04:15

解决方案2 1 2021-07-16 16:46:46

解决方案1
2 已采纳 2021-07-06 13:04:15

解决方案2
1 2021-07-16 16:46:46