使用`perf_event_open`模拟`perf record -g`

Question

My goal is to write some code to record the current call stack for all CPUs at some interval. 我的目标是编写一些代码，以某个时间间隔记录所有CPU的当前调用堆栈。 Essentially I would like to do the same as perf record but using perf_event_open myself. 基本上我想和perf record一样，但我自己也使用perf_event_open 。

According to the manpage it seems I need to use the PERF_SAMPLE_CALLCHAIN sample type and read the results with mmap . 根据联机帮助页，我似乎需要使用PERF_SAMPLE_CALLCHAIN样本类型并使用mmap读取结果。 That said, the manpage is incredibly terse, and some sample code would go a long way right now. 也就是说，该手册页非常简洁，一些示例代码现在还有很长的路要走。

Can someone point me in the right direction? 有人能指出我正确的方向吗？

Answer 1

The best way to learn about this would be to read the Linux kernel source code and see how you can emulate perf record -g yourself. 了解这一点的最佳方法是阅读Linux内核源代码，并了解如何自己模拟perf record -g 。

As you correctly identified, recording of perf events would start with the system call perf_event_open . 正确识别后， perf events记录将从系统调用perf_event_open 。 So that is where we can start, 这就是我们可以开始的地方，

definition of perf_event_open perf_event_open的定义

If you observe the parameters of the system call, you will see that the first parameter is a struct perf_event_attr * type. 如果观察系统调用的参数，您将看到第一个参数是struct perf_event_attr *类型。 This is the parameter that takes in the attributes for the system call. 这是接受系统调用属性的参数。 This is what you need to modify to record callchains. 这是您需要修改以记录调用链。 A sample code could be like this (remember you can tweak other parameters and members of the struct perf_event_attr the way you want) : 示例代码可能是这样的（请记住，您可以按照您想要的方式调整其他参数和struct perf_event_attr的成员）：

     int buf_size_shift = 8;

     static unsigned perf_mmap_size(int buf_size_shift)
     {
       return ((1U << buf_size_shift) + 1) * sysconf(_SC_PAGESIZE);
     }


     int main(int argc, char **argv)
     {

       struct perf_event_attr pe;
       long long count;
       int fd;

       memset(&pe, 0, sizeof(struct perf_event_attr));
       pe.type = PERF_TYPE_HARDWARE;
       pe.sample_type = PERF_SAMPLE_CALLCHAIN; /* this is what allows you to obtain callchains */

       pe.size = sizeof(struct perf_event_attr);
       pe.config = PERF_COUNT_HW_INSTRUCTIONS;
       pe.disabled = 1;
       pe.exclude_kernel = 1;
       pe.sample_period = 1000;
       pe.exclude_hv = 1;

       fd = perf_event_open(&pe, 0, -1, -1, 0); 
       if (fd == -1) {
          fprintf(stderr, "Error opening leader %llx\n", pe.config);
          exit(EXIT_FAILURE);
       }

       /* associate a buffer with the file */
       struct perf_event_mmap_page *mpage;
       mpage = mmap(NULL,  perf_mmap_size(buf_size_shift),
        PROT_READ|PROT_WRITE, MAP_SHARED,
       fd, 0);
       if (mpage == (struct perf_event_mmap_page *)-1L) {
        close(fd);
        return -1;
       }

       ioctl(fd, PERF_EVENT_IOC_RESET, 0);
       ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

       printf("Measuring instruction count for this printf\n");

       ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
       read(fd, &count, sizeof(long long));

       printf("Used %lld instructions\n", count);

       close(fd);
     }

Note: A nice and easy way to understand the handling of all of these perf events can be seen below - 注意：可以在下面看到理解所有这些perf事件处理的简单方法 -

PMU-TOOLS by Andi Kleen 安迪克莱恩的PMU-TOOLS

If you start reading the source code for the system call, you will see that a function perf_event_alloc is being called. 如果您开始阅读系统调用的源代码，您将看到正在调用函数perf_event_alloc 。 This function, among other things, will setup the buffer for obtaining callchains using perf record . 此功能除其他外，将设置缓冲区以使用perf record获取调用链。

The function get_callchain_buffers is responsible for setting up callchain buffers. 函数get_callchain_buffers负责设置callchain缓冲区。

perf_event_open works via a sampling/counting mechanism where if the performance monitoring counter corresponding to the event you are profiling overflows, then all the event relevant information will be collected and stored into a ring-buffer by the kernel. perf_event_open通过采样/计数机制工作，如果与您正在分析的事件相对应的性能监视计数器溢出，则所有与事件相关的信息将被内核收集并存储到环形缓冲区中。 This ring-buffer can be prepared and accessed via mmap(2) . 可以通过mmap(2)准备和访问此环形缓冲区。

Edit #1: 编辑＃1：

The flowchart describing the use of mmap when doing perf record is shown via the below image. 描述在执行perf record时使用mmap的流程图如下图所示。

The process of mmaping ring buffers would start from the first function when you call perf record - which is __cmd_record , this calls record__open , which then calls record__mmap , followed by a call to record__mmap_evlist , which then calls perf_evlist__mmap_ex , this is followed by perf_evlist__mmap_per_cpu and finally ending up in perf_evlist__mmap_per_evsel which is doing most of the heavy-lifting as far as doing an mmap for each event is concerned. 当你调用perf record时，mmaping环形缓冲区的过程将从第一个函数开始 - 这是__cmd_record ，这会调用record__open ，然后调用record__mmap ，然后调用record__mmap_evlist ，然后调用perf_evlist__mmap_ex ，然后是perf_evlist__mmap_per_cpu ，最后在perf_evlist__mmap_per_evsel中结束，就每个事件做一个mmap 而言，它正在完成大部分繁重工作。

Edit #2: 编辑＃2：

Yes you are correct. 是的，你是对的。 When you set the sample period to be, say, a 1000, this means for every 1000th occurrence of the event(which by default is cycles ), the kernel will record a sample of this event into this buffer. 当您将采样周期设置为1000时，这意味着每1000次事件发生（默认情况下为周期），内核会将此事件的样本记录到此缓冲区中。 This means the perf counters will be set to 1000, so that it overflows at 0 and you get an interrupt and eventual recording of the samples. 这意味着perf计数器将设置为1000，因此它会在0处溢出，并且您将获得中断并最终记录样本。

使用`perf_event_open`模拟`perf record -g`

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-03-08 04:09:33

使用`perf_event_open`模拟`perf record -g`

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-03-08 04:09:33

解决方案1
4 已采纳 2018-03-08 04:09:33