我应该如何解释OProfile输出？

Question

I tried profiling my application with OProfile recently. 我最近尝试使用OProfile对应用程序进行性能分析。 The data gathered is already very valuable to me, but I'm having difficulties with its precise interpretation. 收集到的数据对我来说已经非常有价值，但是我很难对其进行准确的解释。 After running my app with oprofile set up and running, I generated the report and got: 使用oprofile设置并运行我的应用程序后，我生成了报告并得到：

root@se7xeon:src# opreport image:test -l -t 1 root @ se7xeon：src＃opreport图片：test -l -t 1
Overflow stats not available 溢出状态不可用
CPU: P4 / Xeon with 2 hyper-threads, speed 3191.66 MHz (estimated) CPU：具有4个超线程的P4 / Xeon，速度3191.66 MHz（估计）
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 750000 计数的GLOBAL_POWER_EVENTS事件（不停止处理器的时间），单位掩码为0x01（强制性），计数为750000
samples % symbol name 样品％符号名称
215522 84.9954 cci::Image::interpolate(unsigned char*, cci::Matrix const&) const 215522 84.9954 cci :: Image :: interpolate（unsigned char *，cci :: Matrix const＆）const
17998 7.0979 cci::Calc::diff(unsigned char const*, unsigned char const*) 17998 7.0979 cci :: Calc :: diff（unsigned char const *，unsigned char const *）
13171 5.1942 cci::Image::getIRect(unsigned char*, int, int) const 13171 5.1942 cci :: Image :: getIRect（unsigned char *，int，int）const
5519 2.1765 cci::Image::getFRect(unsigned char*, double, double) const 5519 2.1765 cci :: Image :: getFRect（unsigned char *，double，double）const

Okay, so my interpolation function is responsible for 84% of the application's (too long) execution time. 好的，所以我的插值函数负责应用程序84％的执行时间（过长）。 Seems a good idea to look into it then: 然后似乎可以考虑一下：

root@se7xeon:src# opannotate image:test --source root @ se7xeon：src＃opannotate图像：test --source
[...] [...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522   84.9954 */  
1392  0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw()  
4  0.0016 :{  
[...]  
:                col0 = static_cast<int>(point[idx].x);  
3  0.0012 :      col1 = col0+1;  
629  0.2498 :    row0 = static_cast<int>(point[idx].y);  
385  0.1529 :    row1 = row0+1;  
56214 22.3266 :  if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:                {  
:                        col0 = row0 = col1 = row1 = 0;  
:                }

If I understand correctly, the if conditional is responsible for over 22% of the program's execution time. 如果我理解正确，则if条件占程序执行时间的22％以上。 The opening brace and the function declaration seem to take time, is that supposed to correspond to the function call overhead ("push parameters on stack, jump, pop parameters" sequence)? 开括号和函数声明似乎要花费时间，是否应该对应于函数调用开销（“堆栈上的推入参数，跳转，弹出参数”序列）？

I changed some things in the source (related to a later bottleneck because I had no idea how to optimize an if), recompiled, ran through oprofile again (not forgetting opcontrol --reset). 我更改了源代码中的某些内容（与后来的瓶颈有关，因为我不知道如何优化if），重新编译，再次通过oprofile运行（不要忘记opcontrol --reset）。 Now the annotated code looks like this in the same place: 现在，带注释的代码在同一位置看起来像这样：

6  0.0024 :     curPx = point[idx].x;  
628  0.2477 :   curPy = point[idx].y;  
410  0.1617 :   col0 = static_cast<int>(curPx);  
57910 22.8380 : col1 = col0+1;  
:               row0 = static_cast<int>(curPy);  
:               row1 = row0+1;  
:               if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:               {  
:                   col0 = row0 = col1 = row1 = 0;  
:               }

This time the if takes basically no time at all (?), the most expensive instruction is "col1 = col0 + 1", and the whole time-taking block seems to have shifted upwards. 这次，if根本不花时间（？），最昂贵的指令是“ col1 = col0 + 1”，整个耗时块似乎都已向上移动。 How can this be? 怎么会这样？ Can this be trusted at all to pinpoint bottlenecks in the source? 可以完全信任此漏洞以查明源中的瓶颈吗？

An another point of doubt for me is that when I set up opcontrol, I entered the traced event as GLOBAL_POWER_EVENTS, with the number of samples being 750k. 我的另一个疑问是，当我设置opcontrol时，我将跟踪事件输入为GLOBAL_POWER_EVENTS，样本数量为750k。 In the output, the interpolation function seems to take 84%, but the number of samples recorded inside it is only a little bit above 200k. 在输出中，插值函数似乎占了84％，但其中记录的样本数量仅略高于200k。 That isn't even 50% of the requested number. 那甚至不是请求数量的50％。 Am I to understand that the remaining ~500k samples was taken by applications not listed in the output (kernel, Xorg, etc.)? 我是否了解剩余的〜500k样本是由输出中未列出的应用程序（内核，Xorg等）采集的？

Answer 1

When profiling optimized code you really cannot rely on accurate source code lines. 在分析优化的代码时，您实际上不能依赖准确的源代码行。 The compiler moves stuff around far too much. 编译器将内容移动太多。

For an accurate picture you will need to look at the code disassembler output. 为了获得准确的图像，您将需要查看代码反汇编程序输出。

Answer 2

OProfile can (they tell me) get stack samples on wall-clock time (not CPU), and it can give you line-level percentages. OProfile可以（他们告诉我）在墙上时钟时间（不是CPU）上获取堆栈样本，并且它可以为您提供行级百分比。 What you are looking for is lines that are contained on a large percent of stack samples. 您正在寻找的是包含在大部分堆栈样本中的行。

I wouldn't turn on compiler optimization until after I finished hand-tuning the code, because it just hides things. 在手动调整代码之后，我才会打开编译器优化，因为它只是隐藏了东西。

When you say the interpolate routine uses 84% of the time, that triggers a question. 当您说插值例程使用84％的时间时，会触发一个问题。 The entire program takes some total time, right? 整个程序需要一些时间，对吗？ It takes 100% of that time. 这需要100％的时间。 If you cut the program's time in half, or if you double it, it will still take 100% of the time. 如果将程序的时间减少一半，或者将其时间增加一倍，则仍将花费100％的时间。 Whether 84% for interpolation is too much or not depends on whether it is being done more than necessary. 84％的插值是否过多取决于是否进行了过多的操作。

So I would suggest that you not ask if the percent of a routine is too much. 因此，我建议您不要问例程的百分比是否太大。 Rather you look for lines of code that take a significant amount of time and ask if they could be optimized. 相反，您会寻找花费大量时间的代码行，并询问是否可以对其进行优化。 See the difference? 看到不同？ After you optimize the code, it can make a large reduction in overall run time, but it might still be a large percent, of a smaller total. 优化代码后，它可以大大减少总体运行时间，但仍可能占总运行时间的很大一部分。 Code isn't optimal when nothing takes a large percent. 当什么都不花很大的钱时，代码不是最佳的。 Code is optimal when of all the things that take a large percent, none can be improved. 当所有花费很大的事情都无法改善时，代码是最佳的。

I don't care for things that just give numbers. 我不在乎只给出数字的东西。 What I want is insight. 我想要的是洞察力。 For example, if that routine accounts for 84% of the time, then if you took 10 samples of the stack , it would be on 8.4 of them. 例如，如果该例程占84％的时间，那么如果您对堆栈进行10个采样，则该采样将占8.4个。 The exact number doesn't matter. 确切的数字无关紧要。 What matters is to understand why it was in there. 重要的是要了解为什么它在那里。 Was it really really necessary to be in there so much? 真的真的有必要去很多地方吗？ That's what looking at the stack samples can tell you. 这就是查看堆栈样本可以告诉您的内容。 Maybe you're actually doing the interpolation twice as often as necessary? 也许您实际上需要进行两次内插？ Often people find out, by analyzing the why , that the routine they're trying to speed up didn't need to be called nearly as much, maybe not at all. 人们通常通过分析原因来发现，他们试图加速的例程几乎不需要调用，甚至根本不需要调用。 I can't guess in your case. 我猜不到你的情况。 Only the insight from examining the program's state can tell you that. 只有检查程序状态的洞察力可以告诉您。

我应该如何解释OProfile输出？

问题描述

2 个解决方案

解决方案1
3 已采纳 2010-10-27 15:01:54

解决方案2
2 2010-10-27 20:56:44

我应该如何解释OProfile输出？

问题描述

2 个解决方案

解决方案1 3 已采纳 2010-10-27 15:01:54

解决方案2 2 2010-10-27 20:56:44

解决方案1
3 已采纳 2010-10-27 15:01:54

解决方案2
2 2010-10-27 20:56:44