简体   繁体   中英

What is the most reliable way to measure the number of cycles of my program in C?

I am familiar with two approaches, but both of them have their limitations.

The first one is to use the instruction RDTSC . However, the problem is that it doesn't count the number of cycles of my program in isolation and is therefore sensitive to noise due to concurrent processes.

The second option is to use the clock library function. I thought that this approach is reliable, since I expected it to count the number of cycles for my program only (what I intend to achieve). However, it turns out that in my case it measures the elapsed time and then multiplies it by CLOCKS_PER_SEC . This is not only unreliable, but also wrong, since CLOCKS_PER_SEC is set to 1,000,000 which does not correspond to the actual frequency of my processor.

Given the limitation of the proposed approaches, is there a better and more reliable alternative to produce consistent results?

A lot here depends on how large an amount of time you're trying to measure.

RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.

Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:

  1. Set the code to only run on one specific core.
  2. Set the code to execute at maximum priority so nothing preempts it.
  3. Use CPUID liberally to ensure serialization where needed.

If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.

RDTSC is the most accurate way of counting program execution cycles. If you are looking to measure execution performance over time scales where it matters if your thread has been preempted, then you would probably be better served with a profiler (VTune, for instance).

CLOCKS_PER_SECOND/clock() is pretty much a very bad (low performance) way of getting time as compared to RDTSC which has almost no overhead.

If you have a specific issue with RDTSC, I may be able to assist.


re: Comments

Intel Performance Counter Monitor : This is mainly for measuring metrics outside of the processor, such as Memory bandwidth, power usage, PCIe utilization. It does also happen to measure CPU frequency, but it typically is not useful for processor bound application performance.

RDTSC portability: RDTSC is an intel CPU instruction supported by all modern Intel CPU's. On modern CPU's it is based on the uncore frequency of your CPU and somewhat similar across CPU cores, although it is not appropriate if your application is frequently being preempted to different cores (and especially to different sockets). If that is the case you really want to look at a profiler.

Out of order Execution : Yes, things get executed out of order, so this can affect performance slightly, but it still takes time to execute instructions and RDTSC is the best way of measuring that time. It excels in the normal use case of executing Non-IO bound instructions on the same core, and this is really how it is meant to be used. If you have a more complicated use case you really should be using a different tool, but that doesn't negate that rdtsc() can be very useful in analyzing program execution.

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES

This system call has explicit controls for:

  • process PID selection
  • whether to consider kernel/hypervisor instructions or not

and it will therefore count the cycles properly even when multiple processes are running concurrently.

See this answer for more details: How to get the CPU cycle count in x86_64 from C++?

perf_event_open.c

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM