简体   繁体   English

将所有位从最低有效位翻转到最重要的最后 1 位值的最有效方法是什么?

[英]what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value?举例来说,我有一个可以是任何值的uint8_t ,我只想翻转所有位,从最低有效位到最重要的最后 1 位值? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?我将如何以最有效的方式做到这一点?有没有可以避免使用循环的解决方案?

here are some cases:以下是一些情况:

left side is the original bits - right side after the flips.左侧是原始位 - 翻转后的右侧。

  • 00011101 -> 00000010 00011101 -> 00000010
  • 00000000 -> 00000000 00000000 -> 00000000
  • 11111111 -> 00000000 11111111 -> 00000000
  • 11110111 -> 00001000 11110111 -> 00001000
  • 01000000 -> 00111111 01000000 -> 00111111

[EDIT] [编辑]

The type could also be larger than uint8_t , It could be uint32_t , uint64_t and __uint128_t .类型也可以大于uint8_t ,它可以是uint32_tuint64_t__uint128_t I just use uint8_t because it's the easiest size to show in the example cases.我只使用uint8_t因为它是示例中最容易显示的大小。

In general I expect that most solutions will have roughly this form:一般来说,我希望大多数解决方案大致具有以下形式:

  1. Compute the mask of bits that need to flipped计算需要翻转的位掩码
  2. XOR by that mask通过那个掩码异或

As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:如评论中所述,x64 是一个感兴趣的目标,在 x64 上,您可以像这样执行第 1 步:

  • Find the 1-based position p of the most significant 1, by leading zeroes ( _lzcnt_u64 ) and subtracting that from 64 (or 32 whichever is appropriate).通过前导零 ( _lzcnt_u64 ) 并从 64(或 32,以适当者为准)中减去它,找到最重要的 1 的基于 1 的 position p
  • Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64 .使用从最低有效位开始的p个连续设置位创建掩码,可能使用_bzhi_u64

There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64).有一些变体,例如使用 BitScanReverse 找到最重要的 1(但它有一个丑陋的零案例),或者使用移位而不是bzhi (但它有一个丑陋的 64 案例)。 lzcnt and bzhi is a good combination with no ugly cases. lzcntbzhi是一个很好的组合,没有丑陋的案例。 bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer). bzhi需要 BMI2(Intel Haswell 或更新版本,AMD Zen 或更新版本)。

Putting it together:把它放在一起:

x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))

Which could be further simplified to这可以进一步简化为

_bzhi_u64(~x,  64 - _lzcnt_u64(x))

As shown by Peter.正如彼得所示。 This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.这不遵循最初的两步计划,而是翻转所有位,然后重置最初为前导零的位。

Since those original leading zeroes form a contiguous sequence of leading ones in ~x , an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 2 64 , putting the set bit just beyond the top of the number).由于这些原始前导零在~x中形成连续的前导 1 序列,因此bzhi的替代方法是将 2 的适当幂添加到~x (尽管有时为零,可能被认为是 2 64 ,将设置位刚好超出数字的顶部)。 Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.不幸的是,我们需要的 2 的幂计算起来有点烦人,至少我想不出一个好的方法,这对我来说似乎是死胡同。

Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:步骤 1 也可以使用一些移位和按位 OR 以通用方式(无特殊操作)实现,如下所示:

// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
//   even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;  // last step should be removed if x is 32-bit

AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/ ), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel. AMD CPU 的 BSR 较慢(但 LZCNT 速度较快; https://uops.info/ ),因此您可能需要uint8_tuint16_t的此转换/或版本(它需要最少的步骤),特别是如果您需要与所有 CPU 兼容并且AMD 的速度比 Intel 更重要。

This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.这个通用版本在 SIMD 元素中也很有用,尤其是窄元素,在 AVX-512 之前我们没有前导零计数。

TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). TL:DR:在为具有lzcnt的 64 位机器(AMD 自 K10,Intel 自 Haswell)编译时,使用uint64_t shift 以高效地实现uint32_t Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.如果没有lzcnt (只有bsr是 x86 的基准), n==0的情况仍然很特殊。


For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent ( lzcnt producing 64 when all bits are zero).对于uint64_t版本,困难的部分是最高设置位有 65 个不同的可能位置,包括不存在的位置( lzcnt在所有位为零时生成 64)。 But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)但是 x86 上具有 64 位操作数大小的单个移位只能产生 64 个不同值之一(假设输入不变),因为 x86 移位会屏蔽计数,如foo >> (c&63)

Using a shift requires special-casing one leading-bit-position, typically the n==0 case.使用移位需要特殊的前导位位置,通常是n==0的情况。 As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.正如 Harold 的回答所示, BMI2 bzhi避免了这种情况,允许从 0..64 开始计数。

Same for 32-bit operand-size shifts: they mask c&31 .与 32 位操作数大小移位相同:它们屏蔽c&31 But to generate a mask for uint32_t , we can use a 64-bit shift efficiently on x86-64.但是要为uint32_t生成掩码,我们可以在 x86-64 上有效地使用 64 位移位。 (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.) (或 uint16_t 和 uint8_t 的 32 位。有趣的事实:8 位或 16 位操作数大小的 x86 asm 移位仍然掩盖了它们的计数 mod 32,因此它们甚至可以移出所有位而无需使用更宽的操作数大小。但是32 位操作数大小是有效的,不需要搞乱部分寄存器写入。)

This strategy is even more efficient than bzhi for a type narrower than register width.对于比寄存器宽度窄的类型,此策略甚至比 bzhi 更有效。

// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good

#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
    uint64_t mask32 = 0xffffffff;  // (uint64_t)(uint32_t)-1u32
    // this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
    // If lznct isn't available, we can't avoid handling n==0  zero specially
    uint32_t mask = mask32 >> _lzcnt_u32(n);
    return n ^ mask;
}
#endif

This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension).这对uint8_tuint16_t等效(字面意思是具有相同掩码的相同代码,在零扩展后对它们使用 32 位 lzcnt)。 But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)不是uint64_t (你可以使用一个unsigned __int128移位,但是shrd掩盖了它的移位计数 mod 64 所以编译器仍然需要一些条件行为来模拟它。所以你不妨做一个手动 cmov 或其他东西,或者sbb same,same来生成寄存器中的0-1作为要移位的掩码。)

Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz ; Godbolt与 gcc 和 clang。请注意,将_lzcnt_u32替换为__builtin_clz是不安全的; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction 1 , and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31 . clang11 和后来假设即使将其编译为lzcnt指令1也不能产生 32 ,并将移位操作数大小优化为 32 ,这将充当mask32 >> clz(n) & 31

# clang 14 -O3 -march=haswell  (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
        lzcnt   eax, edi           # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt.  Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
        mov     ecx, 4294967295
        shrx    rax, rcx, rax
        xor     eax, edi
        ret

Without BMI2, eg with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl .没有 BMI2,例如使用-march=bdver1barcelona (又名 k10),我们得到相同的代码生成,除了shr rax, cl Those CPUs do still have lzcnt , otherwise this wouldn't compile.那些 CPU 仍然有lzcnt ,否则无法编译。

(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf . They lack BMI1/BMI2, but lzcnt has its own feature flag. It seems low-power uarches as recent as Tremont are missing lzcnt , though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core . I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.) (我很好奇 Intel Skylake Pentium/Celeron 是否将lzcnt作为lzcntbsf运行。它们缺少 BMI1/ lzcnt ,但lzcnt有自己的功能标志。不过,根据InstLatx64 用于 Pentium Silver N6005 Jasper Lake-D,Tremont 核心。我没有在最近的 Pentium/Celeron 的原始 CPUID 转储中手动查找功能位,但如果有人想检查, Instlat确实提供了这些功能位。)

Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t , this is the comparison.无论如何, bzhi还需要 BMI2,因此如果您要与除uint64_t之外的任何大小的 BMI2 进行比较,这就是比较。

This shrx version can keep its -1 constant around in a register across loops.这个shrx版本可以在循环中的寄存器中保持其-1不变。 So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register.因此,如果编译器有备用寄存器,则可以在内联后将mov reg,-1提升到循环之外。 The best bzhi strategy doesn't need a mask constant so it has nothing to gain.最好的bzhi策略不需要掩码常量,因此它没有任何好处。 _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. _bzhi_u64(~x, 64 - _lzcnt_u64(x))是 5 微指令,但适用于 64 位机器上的 64 位整数。 Its latency critical path length is the same as this.其延迟关键路径长度与此相同。 (lzcnt / sub / bzhi). (lzcnt/sub/bzhi)。


Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency.如果没有 LZCNT,一个选项可能是始终翻转作为为 CMOV 设置 FLAGS 的一种方式,并使用-1 << bsr(n)将其中一些异或返回到原始 state。这可以减少关键路径延迟。 IDK if a C compiler could be coaxed into emitting this. IDK 如果可以诱使 C 编译器发出它。 Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact.特别是如果你想利用这样一个事实,即如果源为零,真正的 CPU 会保持 BSR 目标不变,但只有 AMD 记录了这一事实。 (Intel says it's an "undefined" result.) (英特尔表示这是一个“未定义”的结果。)

(TODO: finish this hand-written asm idea.) (TODO:完成这个手写的 asm idea。)


Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1 ) in parallel with lzcnt to shorten the critical path latency? uint64_t案例的其他 C 想法: cmovcmp/sbb (生成0-1 )与lzcnt并行以缩短关键路径延迟? See the Godbolt link where I was playing with that.请参阅我正在玩的 Godbolt 链接。

ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. ARM/AArch64 使它们的移位计数饱和,这与标量的 x86 掩码不同。 If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.如果可以安全地利用它(没有 C 班次计数 UB)那将是整洁的,允许像这样的东西。

x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. x86 SIMD移位也使它们的计数饱和,Paul R 通过使用vlzcnt和变量移位的 AVX-512 答案利用了这一点。 (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.) (不过,不值得将数据复制到 XMM reg 并返回一个标量偏移;仅当您有多个元素要做时才有用。)

Footnote 1: clang codegen with __builtin_clz or ...ll脚注 1:clang codegen with __builtin_clz or ...ll

Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible.使用__builtin_clzll(n)将使 clang 使用 64 位操作数大小进行移位,因为从 32 到 63 的值成为可能。 But you can't actually use that to compile for CPUs without lzcnt .但是如果没有lzcnt ,你实际上不能用它来为 CPU 编译。 The 63- bsr a compiler would use without lzcnt available would not produce the 64 we need for that case.如果没有可用的 lzcnt,编译器将使用的 63- bsr不会产生我们在这种情况下需要的64 Not unless you did n<<=1;除非你做了n<<=1; / n|=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov .或者bsr之前的东西并调整了结果,但这会比cmov慢。

If you were using a 64-bit lzcnt , you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t .如果您使用的是 64 位lzcnt ,您需要uint64_t mask = -1ULL因为在零扩展到uint64_t之后会有 32 个额外的前导零。 Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL幸运的是,在所有 ISA 上实现全一相对便宜,所以使用它而不是0xffffffff00000000ULL

Here's a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al ), and is portable across most architectures.下面是 32 位整数的简单示例,它适用于 gcc 和兼容的编译器 (clang et al ),并且可移植到大多数体系结构中。

uint32_t flip(uint32_t n)
{
    if (n == 0) return 0;
    uint32_t mask = ~0U >> __builtin_clz(n);
    return n ^ mask;
}

DEMO演示

We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)如果我们在 x86-64 上使用lzcnt (或在 ARM 上使用clz ),我们可以避免对 n==0 的额外检查,并且我们使用允许计数为 32 的移位。(在 C 中,按类型宽度移位或较大的是未定义的行为。在 x86 上,实际上,对于 64 位以外的移位,移位计数被屏蔽&31 ,因此这可用于使用uint32_t掩码的uint16_tuint8_t 。)

Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0;小心避免 C 未定义的行为,包括关于输入为 0 的__builtin_clz的任何假设; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of.现代 C 编译器不是可移植的汇编器,尽管我们有时希望它们是可移植的,因为语言不能可移植地公开我们想要利用的 CPU 功能。 For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt .例如,clang 假定__builtin_clz(n)不能为 32,即使将其编译为lzcnt

See @PeterCordes's answer for details.有关详细信息,请参阅@PeterCordes 的回答

If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements.如果您的用例对性能至关重要,您可能还需要考虑使用 SIMD 实现来对大量元素执行位翻转操作。 Here's an example using AVX512 for 32 bit elements:下面是一个将 AVX512 用于 32 位元素的示例:

void flip(const uint32_t in[], uint32_t out[], size_t n)
{
    assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
    for (size_t i = 0; i + 8 <= n; i += 8)
    {
        __m512i vin = _mm512_loadu_si512(&in[i]);
        __m512i vlz = _mm512_lzcnt_epi32(vin);
        __m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
        __m512i vout = _mm512_xor_si512(vin, vmask);
        _mm512_storeu_si512(&out[i], vout);
    }
}

This uses the same approach as other solutions, ie count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration.这使用与其他解决方案相同的方法,即计算前导零、创建掩码、异或,但对于 32 位元素,它在每次循环迭代中处理 8 个元素。 You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.您可以类似地实现 64 位版本,但不幸的是,对于元素大小 < 32 位或 > 64 位,没有类似的 AVX512 内在函数。

You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).您可以在Compiler Explorer上看到上面的 32 位示例(注意:如果您在output 窗格 - 这似乎是由于当前 Compiler Explorer 中的一个故障)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM