简体   繁体   中英

what's the difference between __builtin_popcountll and_mm_popcnt_u64?

I was trying to how many 1 in 512MB memory and I found two possible methods, _mm_popcnt_u64() and __builtin_popcountll() in the gcc builtins.

_mm_popcnt_u64() is said to use the CPU introduction SSE4.2,which seems to be the fastest, and __builtin_popcountll() is excepted to use table lookup.

So, I think __builtin_popcountll() should be little slower than _mm_popcnt_u64() .

However I got a result like this:

测试结果

It took almost the same time for two methods. I highly doubt that they used the same way to work.

I also got this in popcntintrin.h

/* Calculate a number of bits set to 1. */
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial___))
_mm_popcnt_u32 (unsigned int __X)
{
  return __builtin_popcount (__X);
}

#ifdef __x86_64__
extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_popcnt_u64 (unsigned long long __X)
{
  return __builtin_popcountll (__X);
}
#endif

So, I'm confused how __builtin_popcountll() works on earth

_mm_popcnt_u64 is part of <nmmintrin.h> , a header devised by Intel for utility functions for accessing SSE 4.2 instructions.

__builtin_popcountll is a GCC extension.

_mm_popcnt_u64 is portable to non-GNU compilers, and __builtin_popcountll is portable to non-SSE-4.2 CPUs. But on systems where both are available, both should compile to the exact same code.

If You compile without march flag, so with x86_64 default, builtin should be slower because it needs to dispatch function selecting between different architectures. This will cause no inlining and additional condition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM