[英]measure time for popcount function in c++
i am interested how to put it in loop so that get real time which is taken by cpu to execute each different operation 我很感兴趣如何将其放入循环中,以便获取cpu执行每个不同操作所花费的实时时间
#include<iostream>
#include<cstdlib>
#include<time.h>
using namespace std;
typedef unsigned __int64 uint64;
const uint64 m1=0x5555555555555555;
const uint64 m2=0x3333333333333333;
const uint64 m4=0x0f0f0f0f0f0f0f0f;
const uint64 m8=0x00ff00ff00ff00ff;
const uint64 m16=0x0000ffff0000ffff;
const uint64 m32=0x00000000ffffffff;
const uint64 hff=0xffffffffffffffff;
const uint64 h01=0x0101010101010101;
uint64 popcount_1(uint64 x)
{
x=(x&m1)+((x>>1)&m1);
x=(x&m2)+((x>>2)&m2);
x=(x&m4)+((x>>4)&m4);
x=(x&m8)+((x>>8)&m8);
x=(x&m16)+((x>>16)&m16);
x=(x&m32)+((x>>32)&m32);
return (uint64)x;
}
//This uses fewer arithmetic operations than any other known
//implementation on machines with slow multiplication.
//It uses 17 arithmetic operations.
int popcount_2(uint64 x)
{
x-=(x>>1)&m1;//put count of each 2 bits into those 2 bits
x=(x&m2)+((x>>2)&m2);//put count of each 4 bits into those 4 bits
x=(x+(x>>4))&m4; //put count of each 8 bits into those 8 bits
x+=x>>8;//put count of each 16 bits into their lowest 8 bits
x+=x>>16;
x+=x>>32;
return x&0x7f;
}
////This uses fewer arithmetic operations than any other known
//implementation on machines with fast multiplication.
//It uses 12 arithmetic operations, one of which is a multiply.
int popcount_3(uint64 x)
{
x-=(x>>1)&m1;
x=(x&m2)+((x>>2)&m2);
x=(x+(x>>4))&m4;
return (x*h01)>>56;
}
uint64 popcount_4(uint64 x)
{
uint64 count;
for(count=0; x; count++)
{
x&=x-1;
}
return count;
}
uint64 random()
{
uint64 r30=RAND_MAX*rand()+rand();
uint64 s30=RAND_MAX*rand()+rand();
uint64 t4=rand()&0xf;
uint64 res=(r30<<34 )+(s30<<4)+t4;
return res;
}
int main()
{
int testnum;
while (true)
{
cout<<"enter number of test "<<endl;
cin>>testnum;
uint64 x= random();
switch(testnum)
{
case 1: {
clock_t start=clock();
popcount_1(x);
clock_t end=clock();
cout<<"execution time of first method"<<start-end<<" "<<endl;
}
break;
case 2: {
clock_t start=clock();
popcount_2(x);
clock_t end=clock();
cout<<"execution time of first method"<<start-end<<" "<<endl;
}
break;
case 3: {
clock_t start=clock();
popcount_3(x);
clock_t end=clock();
cout<<"execution time of first method"<<start-end<<" "<<endl;
}
break;
case 4: {
clock_t start=clock();
popcount_4(x);
clock_t end=clock();
cout<<"execution time of first method"<<start-end<<" "<<endl;
}
break;
default:
cout<<"it is not correct number "<<endl;
break;
}
}
return 0;
}
it writes on terminal always zero inspite of which number test i enter,it is clear for me why because 10 or even 20 and 100 operation is not anything for modern computer,but how could i dot such that get if not exact answer,approximation at least?thanks a lot 它写在终端上始终是零,尽管我输入了哪个数字测试,但对于我来说很清楚为什么因为10甚至20和100的运算对于现代计算机来说不是什么,但是我怎么点这样才能得到不准确的答案,逼近至少?非常感谢
Just repeat all the tests a large number of times. 只需多次重复所有测试即可。
The following repeats
1 Mio (1024*1024)
2^25 times for each test. 对于每个测试,以下将
1 Mio(1024 * 1024)
重复2 ^ 25次。 You might want to divide the times to get the time in nanoseconds, but for comparison the total numbers would be fine (and easier to read). 您可能希望将时间除以十亿分之一秒为单位的时间,但是为了进行比较,总数是可以的(而且更易于阅读)。
int main()
{
int testnum;
while (true)
{
cout<<"enter number of test "<<endl;
cin>>testnum;
uint64 x= random();
clock_t start=clock();
switch(testnum)
{
case 1: for(unsigned long it=0; it<=(1ul<<26); ++it) popcount_1(x); break;
case 2: for(unsigned long it=0; it<=(1ul<<26); ++it) popcount_2(x); break;
case 3: for(unsigned long it=0; it<=(1ul<<26); ++it) popcount_3(x); break;
case 4: for(unsigned long it=0; it<=(1ul<<26); ++it) popcount_4(x); break;
default:
cout<<"it is not correct number "<<endl;
break;
}
clock_t end=clock();
cout<<"execution time of method " << testnum << ": " << (end-start) <<" "<<endl;
}
return 0;
}
Note also fixed start-end
to (end-start)
:) 注意还要将 start-end
固定为(end-start)
:)
You want to perform a microbenchmark of a very cheap operation. 您想要执行非常便宜的操作的微基准测试。 You need to: 你需要:
volatile
variable) in your main program to ensure the compiler can't just optimize the program and remove it. 或者 ,根据所有循环迭代返回一个值, 并在主程序中实际使用该返回值 (例如,将其打印或存储在volatile
变量中),以确保编译器不能只是优化程序并删除它。 clock()
. 另外,您应该使用高分辨率计时器,而不要使用clock()
。 On windows this would be QueryPerformanceCounter(&tick_count)
, on unix clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ×pec_var)
, and on macos have a look at mach_absolute_time()
. 在Windows上,它是QueryPerformanceCounter(&tick_count)
,在unix clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ×pec_var)
,在mach_absolute_time()
。 Another advantage of (some of) these methods is that they measure CPU time, not wall-clock time, and are thus slightly less variable in the face of other activity on the system. 这些方法(其中的某些方法)的另一个优点是,它们测量的是CPU时间,而不是时钟时间,因此,面对系统上的其他活动,它们的可变性较小。 Again, it's absolutely critical to make sure you actually use the values computed either by storing them in a volatile
variable, printing them or by returning them from a non-inlined function to ensure the compiler can't just optimize them away. 同样, 绝对重要的是确保您实际使用所计算的值,方法是将它们存储在volatile
变量中,打印它们或从非内联函数返回它们,以确保编译器不能仅仅对其进行优化。 And you do not want to mark your core method non-inlinable, since function call overhead may well swamp such microbenchmarks; 而你不希望你的标记芯法非可以内联,因为函数调用的开销很可能淹没这样的微基准测试; for similar reasons you should probably avoid random
. 出于类似的原因,您应该避免random
。 This is why you should benchmark a function containing a loop calling the (inlinable) function you're actually interested in. 这就是为什么您应该对包含循环的函数进行基准测试,该循环调用您实际上感兴趣的(不可插入)函数。
For example: 例如:
#include <iostream>
#include <time.h>
typedef unsigned __int64 uint64;
inline uint64 popcount_1(uint64 x)// etc...
template<typename TF>
uint64 bench_intfunc_helper(TF functor, size_t runs){//benchmark this
uint64 retval = 0;
for(size_t i=0; i<runs; ++i) retval += functor(i);
// note that i may not have a representative distribution like this
return retval;//depends on all loop iterations!
}
template<typename TF>
double bench_intfunc(TF functor, size_t runs){
clock_t start=clock();//hi-res timers would be better
volatile auto force_evalution = bench_intfunc_helper(functor,runs);
clock_t end=clock();
return (end-start)/1000.0;
}
#define BENCH(f) do {std::cout<<"Elapsed time for "<< RUNS <<" runs of " #f \
": " << bench_intfunc([](uint64 x) {return f(x);},RUNS) <<"s\n"; } while(0)
int main() {
BENCH(popcount_1);
BENCH(popcount_2);
BENCH(popcount_3);
BENCH(popcount_4);
return 0;
}
Simply omitting volatile
, for example, causes GCC 4.6.3 and MSC 10.0 on my machine to report 0s spent. 例如,仅忽略volatile
会导致我的计算机上的GCC 4.6.3和MSC 10.0报告花费的0。 I'm using a lambda since function pointers aren't inlined by these compilers but lambda's are. 我使用的是lambda,因为这些编译器未内联函数指针,但lambda却是内联函数。
On my machine the output of this benchmark on GCC is: 在我的机器上,此GCC基准测试的输出为:
Elapsed time for 1073741824 runs of popcount_1: 3.7s
Elapsed time for 1073741824 runs of popcount_2: 3.822s
Elapsed time for 1073741824 runs of popcount_3: 4.091s
Elapsed time for 1073741824 runs of popcount_4: 23.821s
and on MSC: 在MSC上:
Elapsed time for 1073741824 runs of popcount_1: 7.508s
Elapsed time for 1073741824 runs of popcount_2: 5.864s
Elapsed time for 1073741824 runs of popcount_3: 3.705s
Elapsed time for 1073741824 runs of popcount_4: 19.353s
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.