是否 perf stat -CX<command> 在核心 X 上运行命令？

Question

I want to profile a command, say ls , on a single core.我想在单个核心上分析一个命令，比如ls 。 I can use the -C flag of perf stat to specify which core to profile, but does ls actually run on the core I choose here?我可以使用perf stat的-C标志来指定要分析的核心，但是ls是否真的在我在这里选择的核心上运行？

Attempting perf stat -C 7 ls , I get wildly different cycle counts ranging from 150k to 5 million.尝试perf stat -C 7 ls ，我得到从 150k 到 500 万的完全不同的循环计数。

I can force ls to run on a specific core with taskset, eg perf stat -C 7 -A taskset --cpu-list 7 ls , but I still get wildly different cycle counts each run - although there does seem to be less variation (from 2-6 million cycles).我可以强制ls使用任务集在特定核心上运行，例如perf stat -C 7 -A taskset --cpu-list 7 ls ，但每次运行我仍然会得到截然不同的循环计数 - 尽管似乎确实有更少的变化（ 2-6 百万次循环）。 Of course, taskset would have some overhead here.当然， taskset在这里会有一些开销。 Is this the correct thing to do to obtain as accurate results as possible?为了获得尽可能准确的结果，这是正确的做法吗？

Answer 1

No, it doesn't.不，它没有。 It counts events on that CPU whether any threads from your task happen to be running on it or not!它计算该 CPU 上的事件，无论您的任务中是否有任何线程碰巧在其上运行！

You can use taskset -c 7 perf stat ... if you don't mind perf itself also running on that CPU core, to avoid profiling taskset.您可以使用taskset -c 7 perf stat ...如果您不介意 perf 本身也在该 CPU 内核上运行，以避免分析任务集。 perf stat has very little if any overhead so it's not a problem that it's on the same core as the workload while it's counting. perf stat几乎没有任何开销，因此在计数时它与工作负载位于同一核心上不是问题。

perf stat -C doesn't imply -a according to the man page, so it's surprising you don't get zero counts more of the time (with the process being profiled not running on the selected CPU core at all).根据手册页， perf stat -C并不意味着-a ，因此令人惊讶的是，您没有更多的时间得到零计数（被分析的进程根本没有在选定的 CPU 内核上运行）。

/bin/ls is a very short-lived workload that spends most of its time in system calls, so it's a weird choice of something to profile. /bin/ls是一个非常短暂的工作负载，大部分时间都花在系统调用上，所以它是一个奇怪的选择来分析一些东西。 4 million cycles is only 1 millisecond on a 4GHz CPU.在 4GHz CPU 上，400 万个周期仅需 1 毫秒。 And much of it is probably spent in kernel code for getdents , so you'd expect high variability anyway if you aren't using --all-user or -e instructions:u,cycles:u and so on.其中大部分可能用于getdents的内核代码，因此如果您不使用--all-user或-e instructions:u,cycles:u等，无论如何您都会期望高可变性。

A normal run of simple workload that uses some CPU time looks like this, on i7-6700k with Linux 5.16:在 i7-6700k 和 Linux 5.16 上，使用一些 CPU 时间的简单工作负载的正常运行如下所示：

$ taskset -c 4 perf stat --all-user awk 'BEGIN{for(i=0;i<10000000;i++){}}'

 Performance counter stats for 'awk BEGIN{for(i=0;i<10000000;i++){}}':

            331.11 msec task-clock                #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               177      page-faults               #  534.559 /sec                   
     1,371,512,156      cycles                    #    4.142 GHz                    
     3,582,591,466      instructions              #    2.61  insn per cycle         
       970,439,895      branches                  #    2.931 G/sec                  
            22,558      branch-misses             #    0.00% of all branches        

       0.331526126 seconds time elapsed

       0.328034000 seconds user
       0.003313000 seconds sys

But 10 back-to-back runs counting user-space-only for a CPU other than the one it's pinned on counts wildly varying numbers of instructions and cycles.但是 10 次背靠背运行只计算用户空间的 CPU，而不是它所固定的 CPU 计算的指令和周期数量差异很大。 (Note the variance of well over 100%.) Not sure what exactly instructions it could be counting, like I said I expected this to be zero. （注意远超过 100% 的方差。）不确定它到底可以计算什么指令，就像我说的那样，我希望这为零。

$ taskset -c 4 perf stat --all-user -r10 -C 3 awk 'BEGIN{for(i=0;i<10000000;i++){}}'

 Performance counter stats for 'CPU(s) 3' (10 runs):

            329.45 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 0      page-faults               #    0.000 /sec                   
         2,692,718      cycles                    #    0.008 GHz                      ( +-124.70% )
         1,875,435      instructions              #    0.24  insn per cycle           ( +-241.60% )
           358,646      branches                  #    1.088 M/sec                    ( +-254.22% )
            12,917      branch-misses             #    0.68% of all branches          ( +- 70.77% )

          0.329648 +- 0.000198 seconds time elapsed  ( +-  0.06% )

Instructions varied from 139k to 3767k over a few runs, and not always the same IPC, sometimes like 1.0, but many others 0.25 +- 0.05指令在几次运行中从 139k 到 3767k 不等，而且并不总是相同的 IPC，有时像 1.0，但许多其他的 0.25 +- 0.05

是否 perf stat -CX<command> 在核心 X 上运行命令？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-05 11:36:46

是否 perf stat -CX<command> 在核心 X 上运行命令？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-05 11:36:46

解决方案1
1 已采纳 2022-07-05 11:36:46