简体   繁体   English

测量一段代码占用的CPU时间,在Unix / Linux上的C中

[英]Measuring amount of CPU time taken by a piece of code, in C on Unix/Linux

Can clock() be used as a dependable API to measure time taken by CPU to execute a snippet of code? clock()可以用作可靠的API来测量CPU执行代码片段所花费的时间吗? When verified usng times() / clock(), both do not seem to measure the CPU time taken precisely. 当使用times()/ clock()验证时,两者似乎都不能精确测量所花费的CPU时间。

Firstly, can the APIs clock()/times() be used to measure the execution time of a function/snippet of code, as given in the example below? 首先,可以使用API​​ clock()/ times()来测量函数/代码片段的执行时间,如下例所示? Is there a better and dependable alternative? 有没有更好更可靠的选择? The mechanism is to work on Linux, HP-UX, IBM-AIX and Sun Solaris as we need to measure (&& compare) the performance of a piece of code on all these platforms. 该机制适用于Linux,HP-UX,IBM-AIX和Sun Solaris,因为我们需要测量(和&比较)所有这些平台上的一段代码的性能。

Kindly suggest. 请建议。 Also, please let me know if am missing anything trivial. 另外,如果我遗漏了任何微不足道的事,请告诉我。

bbb@m_001:/tmp/kk1$ ./perf_clock 102400
{clock(): S          0 E          0 D    0.0000000000}
bbb@m_001:/tmp/kk1$ ./perf_clock 204800
{clock(): S          0 E      10000 D    0.0100000000}
bbb@m_001:/tmp/kk1$ cat perf_clock.c

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

void test_clock(char* sbuf, int* len){
    clock_t start, end; int i=0;
    start = clock();
    while(i++ < 500) memset((char*)sbuf,0,*len);
    end = clock();
    printf("{clock(): S %10lu E %10lu D %15.10f}\n",
        start,end,(end-start)/(double) CLOCKS_PER_SEC);
}
int main(int argc,char* argv[])
{
        int len=atoi(argv[1]);
        char *sbuf=(char*)malloc(len);
        test_clock(sbuf,&len);
        free(sbuf); return 0;
}

The results seem to say that to memset() a 100 KB chunk, 500 times, there is no time spent. 结果似乎说memset()一个100 KB的块,500次,没有时间花。 Or does it say that it is not measurable in microseconds? 还是说它在几微秒内无法测量?

Actually, it is not memset() but another function[that prepares a huge structure sized around 1MB, mallocs a copy of this structure, does an Oracle DB select and populate these structures with the data from DB] which am trying to measure. 实际上,它不是memset()而是另一个函数[准备一个大小约1MB的巨大结构,mallocs这个结构的副本,Oracle DB选择并使用来自DB的数据填充这些结构]我试图测量。 Even this shows 0 ticks, and that is what has kept me confused. 即使这显示0滴答,这也让我感到困惑。

Thanks! 谢谢!

On recent Linux's (*). 在最近的Linux(*)上。 you can get this information from the /proc filesystem. 您可以从/ proc文件系统获取此信息。 In the file /proc/PID/stat the 14th entry has the number of jiffies used in userland code and the 15th entry has the number of jiffies used in system code. 在文件/proc/PID/stat ,第14个条目具有userland代码中使用的jiffies数,第15个条目具有系统代码中使用的jiffies数。

If you want to see the data on a per-thread basis, you should reference the file /proc/PID/task/TID/stat instead. 如果要基于每个线程查看数据,则应该引用文件/proc/PID/task/TID/stat

To convert jiffies to microseconds, you can use the following: 要将jiffies转换为微秒,可以使用以下命令:

define USEC_PER_SEC         1000000UL

long long jiffies_to_microsecond(long long jiffies)
{
    long hz = sysconf(_SC_CLK_TCK);
    if (hz <= USEC_PER_SEC && !(USEC_PER_SEC % hz))
    {
        return (USEC_PER_SEC / hz) * jiffies;
    }
    else if (hz > USEC_PER_SEC && !(hz % USEC_PER_SEC))
    {
        return (jiffies + (hz / USEC_PER_SEC) - 1) / (hz / USEC_PER_SEC);
    }
    else
    {
        return (jiffies * USEC_PER_SEC) / hz;
    }
}

If all you care about is the per-process statistics, getrusage is easier. 如果你关心的只是每个进程的统计数据,那么getrusage就更容易了。 But if you want to be prepared to do this on a per-thread basis, this technique is better as other then the file name, the code would be identical for getting the data per-process or per-thread. 但是如果你想准备在每个线程的基础上做这个,这个技术比文件名更好,代码对于获取每个进程或每个线程的数据是相同的。

* - I'm not sure exactly when the stat file was introduced. * - 我不确定何时引入了stat文件。 You will need to verify your system has it. 您需要验证您的系统是否具有该功能。

I would give a try with getrusage and check system and user time. 我会尝试使用getrusage并检查系统和用户时间。

Also check with gettimeofday to compare with wall clock time. 还要检查gettimeofday以与挂钟时间进行比较。

I would try to correlate the time with the shell's time command, as a sanity check. 我会尝试将时间与shell的time命令关联起来,作为一个完整性检查。

You should also consider that the compiler may be optimizing the loop. 您还应该考虑编译器可能正在优化循环。 Since the memset does not depend on the loop variable the compiler will certainly be tempted to apply an optimization known as loop invariant code motion . 由于memset不依赖于循环变量,因此编译器肯定会尝试应用称为循环不变代码运动的优化。

I would also caution that a 10MB possibly in-cache clear will really be 1.25 or 2.5 million CPU operations as memset certainly writes in 4-byte or 8-byte quantities. 我还要提醒一下,10MB可能的高速缓存清除将真正是1.25或250万CPU操作,因为memset肯定以4字节或8字节的数量写入。 While I rather doubt that this could be done in less than a microsecond, as stores are a bit expensive and 100K adds some L1 cache pressure, you are talking about not much more than one operation per nanosecond, which is not that hard to sustain for a multi-GHz CPU. 虽然我怀疑这可以在不到一微秒的时间内完成,因为存储有点贵,而且100K增加了一些L1缓存压力,你所说的每纳秒不超过一次操作,这并不难以维持一个多GHz CPU。

One imagines that 600 nS would round off to 1 clock tick, but I would worry about that as well. 人们可以想象600 nS可以完成1个时钟滴答,但我也会担心这个问题。

you can use clock_t to get the number of CPU ticks since the program started. 您可以使用clock_t来获取自程序启动以来的CPU滴答数。

Or you can use the linux time command. 或者您可以使用linux time命令。 eg: time [program] [arguments] 例如:time [program] [arguments]

Some info here on HP's page about high resolution timers. 惠普关于高分辨率计时器页面的一些信息。 Also, same trick _Asm_mov_from_ar (_AREG_ITC); 另外,同样的技巧_Asm_mov_from_ar (_AREG_ITC); used in http://www.fftw.org/cycle.h too. 也在http://www.fftw.org/cycle.h中使用。

Have to confirm if this can really be the solution. 必须确认这是否真的可以成为解决方案。

Sample prog, as tested on HP-UX 11.31: 在HP-UX 11.31上测试的示例prog

bbb@m_001/tmp/prof > ./perf_ticks 1024
ticks-memset {func [1401.000000] inline [30.000000]} noop [9.000000]
bbb@m_001/tmp/prof > cat perf_ticks.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include "cycle.h" /* one from http://www.fftw.org/cycle.h */
void test_ticks(char* sbuf, int* len){
    memset((char*)sbuf,0,*len);
}
int main(int argc,char* argv[]){
        int len=atoi(argv[1]);
        char *sbuf=(char*)malloc(len);
        ticks t1,t2,t3,t4,t5,t6;
        t1 =getticks(); test_ticks(sbuf,&len); t2 =getticks();
        t3 =getticks(); memset((char*)sbuf,0,len); t4 =getticks();
        t5=getticks();;t6=getticks();
        printf("ticks-memset {func [%llf] inline [%llf]} noop [%llf]\n",
                          elapsed(t2,t1),elapsed(t4,t3),elapsed(t6,t5));
        free(sbuf); return 0;
}
bbb@m_001/tmp/prof >

Resource usage of a process/thread is updated by the OS only periodically. 进程/线程的资源使用仅由OS定期更新。 It's entirely possible for a code snippet to complete before the next update thus producing zero resource usage diffs. 完全有可能在下一次更新之前完成代码片段,从而产生零资源使用差异。 Can't say anything about HP or AIX, would refer you to Solaris Performance and Tools book for Sun. 无法说出有关HP或AIX的任何内容,请参考Sun的Solaris性能和工具书。 For Linux you want to look at oprofile and newer perf tool . 对于Linux,您需要查看oprofile和更新的perf工具 On the profiling side valgrind would be of much help. 在剖析方面, valgrind将会有很大帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM