So for the following sequence: 0001000111000
The desired result would be: 0001000000000
I am fully aware that this is achievable by finding the MSB's index with assembly BSRL (or similar bit-twiddling hack) then >> bit shifting the number by (index - 1), then << shifting back by (index - 1), but I want to know whether there is, specifically, an assembly instruction or a sequence of instructions with better performance, not a bit-twiddling hack that can do this.
There's no single instruction that can do this. BMI1 blsi dst,src
can isolate the lowest set bit, not the highest. ie x & -x
. If x86 had a bit-reversed version of blsi
, we could use that, but it doesn't.
But you can do much better than what you were suggesting . An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It's 1 << bsr(input)
.
;; input: x in RDI
;; output: result in RAX
isolate_msb:
xor eax, eax ; tmp = 0
bsr rdi, rdi ; edi = bit index of MSB in input
jz .input_was_zero
bts rax, rdi ; rax |= 1<<edi
.input_was_zero: ; return 0 for input=0
ret
Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.
The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n
on Intel CPUs. It's 2 uops with 2c latency on AMD, so mov rax,1
/ shl rax,cl
would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx
.
Anyway, the real work here is BSR + BTS, so that's 3 cycle + 1 cycle latency on Intel SnB-family. ( https://agner.org/optimize/ )
unsigned isolate_msb32(unsigned x) {
unsigned bitidx = BSR32(x);
//return 1ULL << bitidx; // if x is definitely non-zero
return x ? 1U << bitidx : x;
}
unsigned isolate_msb64(uint64_t x) {
unsigned bitidx = BSR64(x);
return x ? 1ULL << bitidx : x;
}
Where BSR32
is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There's no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x)
back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).
#include <stdint.h>
// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
#ifdef __INTEL_COMPILER
typedef unsigned int bsr_idx_t;
#else
#include <intrin.h> // MSVC
typedef unsigned long bsr_idx_t;
#endif
static inline
unsigned BSR32(unsigned long x){
bsr_idx_t idx;
_BitScanReverse(&idx, x); // ignore bool retval
return idx;
}
static inline
unsigned BSR64(uint64_t x) {
bsr_idx_t idx;
_BitScanReverse64(&idx, x); // ignore bool retval
return idx;
}
#elif defined(__GNUC__)
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
On the Godbolt compiler explorer , clang and ICC compile this branchlessly, even when they don't know that x
is non-zero.
All 4 compilers fail to use bts
to implement 1<<bit
. :( It's very cheap on Intel.
# clang7.0 -O3 -march=ivybridge (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate. /sigh.
isolate_msb32(unsigned int):
bsr ecx, edi
mov eax, 1
shl rax, cl
test edi, edi
cmove eax, edi # return 1<<bsr(x) or x (0) if x was zero
ret
GCC and MSVC make branchy code. eg
# gcc8.2 -O3 -march=haswell
mov eax, edi
test edi, edi
je .L6
bsr eax, edi
mov edi, 1
shlx rax, rdi, rax # BMI2: 1 uop instead of 3 for shl rax,cl
.L6:
ret
There is not single instruction for what you ask, no.
But, if you want to avoid twiddling the bits of the variable, there is an alternative approach:
Declare a 2nd variable of the same type as the original variable, and set the 2nd variable to 0. Then loop through the bits of the original variable from highest bit to lowest bit, testing each bit with the &
operator. If you find a bit set to 1, set the corresponding bit in the 2nd variable, then exit the loop. Assign the 2nd variable to the original variable, if needed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.