[英]How to measure instruction execution time (with Clock Cycle Unit) instead of second in C++?
My question is I'm not sure if my function calcClock() is correct and I'm not sure if the function name is correct too, and "What is the exact Formula for measuring the execution time with Clock Cycle Unit?".我的问题是我不确定我的 function calcClock() 是否正确,我不确定 function 名称是否也正确,以及“使用时钟周期单元测量执行时间的确切公式是什么?”。 I was reading http://www0.cs.ucl.ac.uk/teaching/B261/Slides/lecture2/tsld015.htm but I don't understand, because it doesn't have an example and difficult to be understood.
我正在阅读http://www0.cs.ucl.ac.uk/teaching/B261/Slides/lecture2/tsld015.htm但我不明白,因为它没有示例且难以理解。
The reason I ask such question is I will measure many functions execution time such as _multiply() and once the measurements are placed, they must not be changed anymore.我问这样的问题的原因是我将测量许多函数的执行时间,例如 _multiply() 并且一旦放置测量值,它们就不能再更改了。
Edit after answers: I renamed calcClock to calcClockCycles, and totalPerformedInstructions to totalPerformedExpressions, because an expression can have multiple instructions.回答后编辑:我将 calcClock 重命名为 calcClockCycles,将 totalPerformedInstructions 重命名为 totalPerformedExpressions,因为一个表达式可以有多个指令。
#include <chrono>
struct Chrono {
// Referenced from:
// - https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now
// - https://levelup.gitconnected.com/8-ways-to-measure-execution-time-in-c-c-48634458d0f9
private:
std::chrono::high_resolution_clock::time_point _start, _end;
public:
void start() {
_start = std::chrono::high_resolution_clock::now();
}
void end() {
_end = std::chrono::high_resolution_clock::now();
}
double elapsed() {
std::chrono::duration<double> diff = _end - _start;
}
double calcClockCycles(int totalPerformedExpressions, float GHz) { // I set GHz to 2.4 with "Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz (4 CPUs), ~2.4GHz".
return elapsed() / totalPerformedExpressions * GHz*1000*1000*1000;
}
};
Example of application in main.cpp main.cpp 中的应用示例
Chrono g_ch;
int g_iterations = 2*1000*1000;
float g_GHz = 2.4f;
#define ITERx100_EXPRESSIONS(X) \
for (int i = 0; i < g_iterations; i++) { \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
X; X; X; X; X; X; X; X; X; X; \
}
inline vec2 _multiply(const vec2 &_v, const mat2x2 &_M) {
// |6|T |2 3|T |6*2+4*3| |24|
// |4| * |7 5| = |6*7+4*5| = |62|
#if defined(__GNUC__)
// M = {[0],[1],
// [2],[3]}
v4sf o;
v4sf &v = *(v4sf *)&_v;
v4sf &M = *(v4sf *)&_M;
#if 0
o[0] = v[0]*M[0] + v[1]*M[1];
o[1] = v[0]*M[2] + v[1]*M[3];
#elif 1
// v4sf a = __builtin_shuffle(v, v4si{0,1,0,1}) * M;
// o[0] = a[0] + a[1];
// o[1] = a[2] + a[3];
//
// v4sf a = __builtin_shuffle(v, v4si{0,1,0,1}) * M;
// o = __builtin_shuffle(a, v4si{0,2}) + __builtin_shuffle(a, v4si{1,3});
v4sf a = __builtin_shuffle(v, v4si{0,0,1,1}) * __builtin_shuffle(M, v4si{0,2,1,3});
o = a + __builtin_shuffle(a, v4si{2,3});
#endif
return *(vec2 *)&o;
#else
return _multiply_slow(_v, _M);
#endif
}
void mat2x2_vxM() {
mat2x2 M = v4sf{
2,3,
7,5,
};
vec2 v(6,4);
vec2 V;
g_ch.start();
ITERx100_EXPRESSIONS(V = _multiply(v, M));
g_ch.end();
printf("%s: %s, %g\n", __func__, to_string(V).c_str(), g_ch.calcClockCycles(100 * g_iterations, g_GHz));
}
int main() {
mat2x2_vxM();
return 0;
}
Example of measurements, where "v1 = v2;"测量示例,其中“v1 = v2;” has 1 clock unit, but "I'm not sure if it's right", nobody told me it's exact, it has 2 instructions (
movaps
and another movaps
).有 1 个时钟单位,但“我不确定它是否正确”,没有人告诉我它是准确的,它有 2 个指令(
movaps
和另一个movaps
)。
Edit: There is no optimizations, it's [Debug] It's almost impossible for me to build an estimation measurement calculation because I'm building a programming language that needs to estimate the Clock Cycles of an expression.编辑:没有优化,它是 [Debug] 我几乎不可能构建估计测量计算,因为我正在构建一种需要估计表达式的时钟周期的编程语言。
What I think you want is not easy task at all.我认为你想要的根本不是一件容易的事。
time=cycles*frequency
era of CPUs, especially with something like Intel's i7.time=cycles*frequency
的 CPU 时代,尤其是像英特尔的 i7 这样的东西。ITERx100_EXPRESSIONS
is not optimized away by the compiler because as it is written, it absolutely will be if the compiler can prove the repeated X
are useless.ITERx100_EXPRESSIONS
不会被编译器优化掉,因为在编写时,如果编译器可以证明重复的X
是无用的,那绝对会是这样。 Since the compiler sees inside _multiply
- it touches no globals and takes args by const ref - making it pure, yep that is prime candidate for throwing away not only the repeated X
but the loop itself too._multiply
内部看到 - 它不接触全局变量并通过 const ref 获取 args - 使其成为纯粹的,是的,它不仅是丢弃重复的X
也是丢弃循环本身的主要候选者。 I see no problems* with Chrono
itself but due to reasons stated above calcClock
is not really meaningful.我认为
Chrono
本身没有问题*,但由于上述原因, calcClock
并没有真正意义。 My advice would be to focus on program design, correctness, and proper encapsulation.我的建议是专注于程序设计、正确性和适当的封装。 Leave performance for the compiler.
将性能留给编译器。
*Maybe add a compile check for std::high_resolution_clock::is_steady
so you are not surprised later. *也许为
std::high_resolution_clock::is_steady
添加一个编译检查,这样你以后就不会感到惊讶了。
Then you should construct a real-world benchmark (or as close as possible) only after that, you can play with changing the implementation and seeing how it impacts the benchmark.然后,您应该构建一个真实世界的基准(或尽可能接近),然后您可以尝试更改实现并查看它如何影响基准。 Those will be the most important measurements you should care about.
这些将是您应该关心的最重要的测量值。 You should then look at disassembly and try to explain/learn from those numbers.
然后,您应该查看反汇编并尝试从这些数字中解释/学习。
I guess you could always look at the disassembly first and refer to instruction latencies and throughputs and calculate a guestimate from that.我想您总是可以先查看反汇编,然后参考指令延迟和吞吐量,然后从中计算出一个估计值。 They should be found in the reference manual for the CPU but that calculation is likely really non-trivial too.
它们应该可以在 CPU 的参考手册中找到,但这种计算也可能非常重要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.