简体   繁体   English

计算 FLops

[英]Calculating FLops

I am writing a program to calculate the duration that my CPU take to do one "FLops".我正在编写一个程序来计算我的 CPU 执行一次“FLops”所需的持续时间。 For that I wrote the code below为此我写了下面的代码

before = clock();
y= 4.8;
x= 2.3;
z= 0;
for (i = 0; i < MAX; ++i){
 z=x*y+z;
}
printf("%1.20f\n", ( (clock()-before )/CLOCKS_PER_SEC )/MAX);

The problem that I am repeating the same operation.我重复相同操作的问题。 Doesn't the compiler optimize this sort of "Thing"?编译器不会优化这种“东西”吗? If so what I have to do to get the correct results?如果是这样,我必须做什么才能获得正确的结果?

I am not using the "rand" function so it does not conflict my result.我没有使用“rand”函数,所以它不会与我的结果冲突。

This has a loop-carried dependency and not enough stuff to do in parallel, so if anything is even executed at all, it would not be FLOPs that you're measuring, with this you will probably measure the latency of floating point addition.这具有循环携带的依赖性,并且没有足够的东西可以并行执行,因此如果根本执行任何操作,则不会是您正在测量的 FLOP,因此您可能会测量浮点加法的延迟。 The loop carried dependency chain serializes all those additions.循环携带的依赖链序列化了所有这些添加。 That chain has some little side-chains with multiplications in them, but they don't depend on anything so only their throughput matters.该链有一些带有乘法的小侧链,但它们不依赖任何东西,因此只有它们的吞吐量很重要。 But that throughput is going to be better than the latency of an addition on any reasonable processor.但该吞吐量将比任何合理处理器上添加的延迟要好。

To actually measure FLOPs, there is no single recipe.要实际测量 FLOP,没有单一的方法。 The optimal conditions depend strongly on the microarchitecture.最佳条件在很大程度上取决于微体系结构。 The number of independent dependency chains you need, the optimal add/mul ratio, whether you should use FMA, it all depends.你需要的独立依赖链的数量,最佳的加/乘比,你是否应该使用 FMA,这一切都取决于。 Typically you have to do something more complicated than what you wrote, and if you're set on using a high level language, you have to somehow trick it into actually doing anything at all.通常你必须做一些比你写的更复杂的事情,如果你准备使用高级语言,你必须以某种方式欺骗它实际做任何事情。

For inspiration see how do I achieve the theoretical maximum of 4 FLOPs per cycle?寻找灵感,看看我如何达到每个周期 4 FLOPs 的理论最大值?

Even if you have no compiler optimization going on (possibilities have already been nicely listed), your variables and result will be in cache after the first loop iteration and from then on your on the track with way more speed and performance than you would be, if the program would have to fetch new values for each iteration.即使您没有进行编译器优化(可能性已经很好地列出),您的变量和结果将在第一次循环迭代后缓存,从那时起,您将以比您更高的速度和性能进入赛道,如果程序必须为每次迭代获取新值。

So if you want to calculate the time for a single flop for a single iteration of this program you would actually have to give new input for every iteration.因此,如果您想为该程序的单次迭代计算一次触发器的时间,您实际上必须为每次迭代提供新的输入。 Really consider using rand() and just seed with a known value srand(1) or so.真正考虑使用 rand() 并且只使用已知值srand(1)左右的种子。

Your calculations should also be different;你的计算也应该不同; flops are the number of computations your program does so in your case 2*n (where n = MAX). flops 是您的程序在您的情况下进行的计算次数 2*n(其中 n = MAX)。 To calculate the amount of time per flop divide time used by the amount of flops.计算每个触发器的时间除以触发器的数量所用的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM