简体   繁体   English

性能如何将事件与功能关联?

[英]How does perf associate events to functions?

More precisely how does the perf tool associate PMU events to functions i already realized that when the kernel perf subsystem records the event counters it also records the Program Counter (PC) so it can associate the count to a function. 更准确地说,性能工具如何将PMU事件与功能关联起来,我已经意识到,当内核性能子系统记录事件计数器时,它还会记录程序计数器(PC),以便将计数与功能关联起来。

However to really get fine grain result, you need to sample the counters in a very high rate, otherwise you may associate counters to a group of functions. 但是,要真正获得精细的结果,您需要以很高的速率对计数器进行采样,否则您可以将计数器与一组功能关联。 But reading the counters and writing the sampled data (counters, PC, call-stack) to the perf mmap space is very intrusive. 但是读取计数器并将采样数据(计数器,PC,调用堆栈)写入perf mmap空间非常麻烦。

I read in some sources that this sampling only happens when the PMU counters overflow, but this is can be very coarse unless i am setting the counters to overflow very quickly 我在一些资料中读到,这种采样仅在PMU计数器溢出时才发生,但是这可能非常粗糙,除非我将计数器设置为非常快地溢出

what am i missing here ? 我在这里想念什么?

perf record is statistical profiling tool , it either program hardware performance event monitor unit (PMU) to overflow after some number of counts (for example with -e cycles -c 1000000 write -1000000 to counter and enable counting cycles; with -F or without freq/period argument it will autotune value), on overflow interrupt perf will reprogram it for next count. perf record统计分析工具 ,它可以对硬件性能事件监视器单元(PMU)进行编程以在经过一定数量的计数后溢出(例如,使用-e cycles -c 1000000将-1000000写入计数器并启用计数循环;使用-F或不使用freq / period参数将自动调整值),在溢出中断时,perf将对其进行重新编程以进行下一次计数。 So it will have several hundreds or few thousands events per second. 因此,每秒将有数百或数千个事件。 Or it can use OS timer interrupt ( -e task-clock ) to get periodic samples. 或者,它可以使用OS计时器中断( -e task-clock )获取定期采样。 On every sample (or on interrupt from hardware PMU) perf will record current PC (EIP) and/or callstack; 在每个样本上(或在硬件PMU的中断下),perf都会记录当前的PC(EIP)和/或调用堆栈; and it does not record current value of counter (check full dump of data stored in the perf.data with perf script or perf script -D ; or code of sample event dumping - there is sample->ip but not current count of PMU). 并且它不记录计数器的当前值(使用perf scriptperf script -D检查存储在perf.data中的数据的完整转储;或者示例事件转储的代码 -有sample->ip但没有PMU的当前计数) 。

perf report will parse perf.data to get all PC recorded in it. perf report将解析性能数据以将所有PC记录在其中。 It will count how many times each PC was sampled to build histogram [PC] -> sample_count . 它将计算每个PC采样多少次以构建直方图[PC] -> sample_count Every PC will be associated with the exact function it belongs (perf report will parse memory map, as mmap events are recorded in perf.data too, open every binary used, find symbols table of every binary). 每个PC都将与其所属的确切功能相关联(perf报告将解析内存映射,因为mmap事件也记录在perf.data中,打开每个使用的二进制文件,查找每个二进制文件的符号表)。

Actual code of perf report is in linux/tools/perf/builtin-report.c : cmd_report / __cmd_report -> perf_session__process_events -> some magic -> process_sample_event to record all mentioned in perf.data ip (PC) values with hist_entry_iter__add(&iter, &al, rep->max_stack, rep); 的实际代码perf report是在linux/tools/perf/builtin-report.ccmd_report / __cmd_report - > perf_session__process_events - >一些神奇- > process_sample_event记录所有在perf.data提到的ip (PC)与值hist_entry_iter__add(&iter, &al, rep->max_stack, rep); into histogram with hist_iter__report_callback : 进入带有hist_iter__report_callback直方图中:

hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
. . . (perf/util/annotate.c) __symbol__inc_addr_samples
  611         h->addr[offset]++;

Then it will output collected histogram with report__browse_hists -> perf_evlist__tty_browse_hists -> hists__fprintf_nr_sample_events(hists, rep, evname, stdout); 然后它会输出带有report__browse_hists > perf_evlist__tty_browse_hists > hists__fprintf_nr_sample_events(hists, rep, evname, stdout);收集到的直方图hists__fprintf_nr_sample_events(hists, rep, evname, stdout); .

Every sample is already associated with exact function (and bit inexact instruction inside it because of out-of-order nature of CPUs and not-precise PMU overflow event), and this is how statistical profiling works . 每个样本都已经与确切的功能相关联(由于CPU的乱序性质和不精确的PMU溢出事件,其内部的位指令不准确),这就是统计分析的工作方式 When your program runs for short time (less than second) and/or you have too low sampling frequency, you may have few samples recorded in perf.data . 当程序运行时间短(少于一秒钟)和/或采样频率太低时, perf.data可能记录的样本perf.data But if you has more than several hundreds samples, you can find most cpu-heavy functions (they probably have pareto rule and runs for around several dozens percents of program run time. When you want to see smaller functions (around several percent of running time), use thousands or tens or thousands samples and do some statistical estimations (you will not get correct percent of function which runs for 0.1% of time when you have 100 or 1000 samples). 但是,如果您有数百个样本,那么您会发现大多数cpu繁重的函数(它们可能具有pareto规则,并且会在程序运行时间中运行约百分之几十。当您希望看到较小的函数时(约占运行时间的百分之几) ),使用数千个,数十个或数千个样本并进行一些统计估算(当您拥有100或1000个样本时,您将无法获得正确百分比的函数,该函数将在0.1%的时间内运行)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM