简体   繁体   English

如何在 C++ 中测量指令执行时间(使用时钟周期单元)而不是秒?

[英]How to measure instruction execution time (with Clock Cycle Unit) instead of second in C++?

My question is I'm not sure if my function calcClock() is correct and I'm not sure if the function name is correct too, and "What is the exact Formula for measuring the execution time with Clock Cycle Unit?".我的问题是我不确定我的 function calcClock() 是否正确,我不确定 function 名称是否也正确,以及“使用时钟周期单元测量执行时间的确切公式是什么?”。 I was reading http://www0.cs.ucl.ac.uk/teaching/B261/Slides/lecture2/tsld015.htm but I don't understand, because it doesn't have an example and difficult to be understood.我正在阅读http://www0.cs.ucl.ac.uk/teaching/B261/Slides/lecture2/tsld015.htm但我不明白,因为它没有示例且难以理解。

The reason I ask such question is I will measure many functions execution time such as _multiply() and once the measurements are placed, they must not be changed anymore.我问这样的问题的原因是我将测量许多函数的执行时间,例如 _multiply() 并且一旦放置测量值,它们就不能再更改了。

Edit after answers: I renamed calcClock to calcClockCycles, and totalPerformedInstructions to totalPerformedExpressions, because an expression can have multiple instructions.回答后编辑:我将 calcClock 重命名为 calcClockCycles,将 totalPerformedInstructions 重命名为 totalPerformedExpressions,因为一个表达式可以有多个指令。

#include <chrono>
struct Chrono {
    // Referenced from:
    // - https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now
    // - https://levelup.gitconnected.com/8-ways-to-measure-execution-time-in-c-c-48634458d0f9

private:
    std::chrono::high_resolution_clock::time_point _start, _end;

public:
    void start() {
        _start = std::chrono::high_resolution_clock::now();
    }
    void end() {
        _end = std::chrono::high_resolution_clock::now();
    }
    double elapsed() {
        std::chrono::duration<double> diff = _end - _start;
    }
    double calcClockCycles(int totalPerformedExpressions, float GHz) { // I set GHz to 2.4 with "Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz (4 CPUs), ~2.4GHz".
        return elapsed() / totalPerformedExpressions * GHz*1000*1000*1000;
    }
};

Example of application in main.cpp main.cpp 中的应用示例

Chrono g_ch;
int g_iterations = 2*1000*1000;
float g_GHz = 2.4f;

#define ITERx100_EXPRESSIONS(X) \
    for (int i = 0; i < g_iterations; i++) { \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
        X; X; X; X; X; X; X; X; X; X; \
    }

inline vec2 _multiply(const vec2 &_v, const mat2x2 &_M) {
    // |6|T   |2 3|T   |6*2+4*3|   |24|
    // |4|  * |7 5|  = |6*7+4*5| = |62|
    
    #if defined(__GNUC__)
        // M = {[0],[1],
        //      [2],[3]}
        v4sf o;
        v4sf &v = *(v4sf *)&_v;
        v4sf &M = *(v4sf *)&_M;
        #if 0
            o[0] = v[0]*M[0] + v[1]*M[1];
            o[1] = v[0]*M[2] + v[1]*M[3];
        #elif 1
            // v4sf a = __builtin_shuffle(v, v4si{0,1,0,1}) * M;
            // o[0] = a[0] + a[1];
            // o[1] = a[2] + a[3];
            //
            // v4sf a = __builtin_shuffle(v, v4si{0,1,0,1}) * M;
            // o = __builtin_shuffle(a, v4si{0,2}) + __builtin_shuffle(a, v4si{1,3});
            
            v4sf a = __builtin_shuffle(v, v4si{0,0,1,1}) * __builtin_shuffle(M, v4si{0,2,1,3});
            o = a + __builtin_shuffle(a, v4si{2,3});
        #endif
        return *(vec2 *)&o;
    #else
        return _multiply_slow(_v, _M);
    #endif
}

void mat2x2_vxM() {
    mat2x2 M = v4sf{
        2,3,
        7,5,
    };
    vec2 v(6,4);
    vec2 V;

    g_ch.start();
    ITERx100_EXPRESSIONS(V = _multiply(v, M));
    g_ch.end();
    printf("%s: %s, %g\n", __func__, to_string(V).c_str(), g_ch.calcClockCycles(100 * g_iterations, g_GHz));
}

int main() {
    mat2x2_vxM();
    return 0;
}

Example of measurements, where "v1 = v2;"测量示例,其中“v1 = v2;” has 1 clock unit, but "I'm not sure if it's right", nobody told me it's exact, it has 2 instructions ( movaps and another movaps ).有 1 个时钟单位,但“我不确定它是否正确”,没有人告诉我它是准确的,它有 2 个指令( movaps和另一个movaps )。 在此处输入图像描述

Edit: There is no optimizations, it's [Debug] It's almost impossible for me to build an estimation measurement calculation because I'm building a programming language that needs to estimate the Clock Cycles of an expression.编辑:没有优化,它是 [Debug] 我几乎不可能构建估计测量计算,因为我正在构建一种需要估计表达式的时钟周期的编程语言。 在此处输入图像描述

What I think you want is not easy task at all.我认为你想要的根本不是一件容易的事。

  • There is no builtin support for counting clock cycles in C++. C++ 中没有对时钟周期计数的内置支持。
  • We are long past time=cycles*frequency era of CPUs, especially with something like Intel's i7.我们早就过了time=cycles*frequency的 CPU 时代,尤其是像英特尔的 i7 这样的东西。
  • If you really want to measure how many clocks a sequence of instructions takes, you cannot do it from an ordinary user-space program because you are at mercy of the scheduler and the many interrupts running there.如果你真的想测量一个指令序列需要多少时钟,你不能从一个普通的用户空间程序中做到这一点,因为你受到调度程序和在那里运行的许多中断的支配。
  • Cache will have huge impact on any memory loads/stores so the context in which the function is called matters a lot.缓存将对任何 memory 加载/存储产生巨大影响,因此调用 function 的上下文非常重要。
  • Simply running a function in a for loop and averaging the runtime is not guaranteed to work at all.仅在 for 循环中运行 function 并平均运行时间并不能保证完全有效。 First, the cache can really skew the results compared to a real benchmark.首先,与真正的基准相比,缓存确实会使结果产生偏差。 On the other hand, hot paths are likely cached in the benchmark too.另一方面,热路径也可能缓存在基准测试中。 Second, you better ensure that ITERx100_EXPRESSIONS is not optimized away by the compiler because as it is written, it absolutely will be if the compiler can prove the repeated X are useless.其次,您最好确保ITERx100_EXPRESSIONS不会被编译器优化掉,因为在编写时,如果编译器可以证明重复的X是无用的,那绝对会是这样。 Since the compiler sees inside _multiply - it touches no globals and takes args by const ref - making it pure, yep that is prime candidate for throwing away not only the repeated X but the loop itself too.由于编译器在_multiply内部看到 - 它不接触全局变量并通过 const ref 获取 args - 使其成为纯粹的,是的,它不仅是丢弃重复的X也是丢弃循环本身的主要候选者。

I see no problems* with Chrono itself but due to reasons stated above calcClock is not really meaningful.我认为Chrono本身没有问题*,但由于上述原因, calcClock并没有真正意义。 My advice would be to focus on program design, correctness, and proper encapsulation.我的建议是专注于程序设计、正确性和适当的封装。 Leave performance for the compiler.将性能留给编译器。

*Maybe add a compile check for std::high_resolution_clock::is_steady so you are not surprised later. *也许为std::high_resolution_clock::is_steady添加一个编译检查,这样你以后就不会感到惊讶了。

Then you should construct a real-world benchmark (or as close as possible) only after that, you can play with changing the implementation and seeing how it impacts the benchmark.然后,您应该构建一个真实世界的基准(或尽可能接近),然后您可以尝试更改实现并查看它如何影响基准。 Those will be the most important measurements you should care about.这些将是您应该关心的最重要的测量值。 You should then look at disassembly and try to explain/learn from those numbers.然后,您应该查看反汇编并尝试从这些数字中解释/学习。

I guess you could always look at the disassembly first and refer to instruction latencies and throughputs and calculate a guestimate from that.我想您总是可以先查看反汇编,然后参考指令延迟和吞吐量,然后从中计算出一个估计值。 They should be found in the reference manual for the CPU but that calculation is likely really non-trivial too.它们应该可以在 CPU 的参考手册中找到,但这种计算也可能非常重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM