How to Configure and Sample Intel Performance Counters In-Process

Question

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):

results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
    pctr_start = sample_pctr();
    the_benchmark();
    pctr_stop = sample_pctr();
    results[iteration] = pctr_stop - pctr_start;
}

FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL , to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).

My initial intention was to use inline assembler to first configure a counter using WRMSR , then to read the counter using RDPMC inside sample_pctr() .

I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.

Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?

Stuff I've looked into/thought about:

Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (ie perf_event_open ). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.

Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.

Any help is most appreciated. Thanks.

EDIT: I just found the Linux msr device node , which would probably suffice. I'll leave the question up in case a better answer shows up.

Answer 1

It seems the best way -- for Linux at least -- is to use the msr device node .

You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.

OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

How to Configure and Sample Intel Performance Counters In-Process

Question

1 answers

solution1
0 ACCPTED 2016-08-19 11:21:55

How to Configure and Sample Intel Performance Counters In-Process

Question

1 answers

solution1 0 ACCPTED 2016-08-19 11:21:55

solution1
0 ACCPTED 2016-08-19 11:21:55