为什么在 64 位机器上使用 uint8_t 编写的这段代码比使用 uint32_t 或 uint64_t 编写的类似代码运行得更快？

Question

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?由于隐式提升，64 位系统上的数学运算在 32/64 位数据类型上比 short 等较小的数据类型运行得更快，这不是常识吗？ Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t.然而，在测试我的位集实现时（大部分时间取决于按位运算），我发现使用 uint8_t 比 uint32_t 提高了约 40%。 I'm especially surprised because there is hardly any copying going on that would justify the difference.我特别惊讶，因为几乎没有任何复制可以证明差异是合理的。 The same thing occurred regardless of the clang optimisation level.无论 clang 优化级别如何，都会发生同样的事情。

8bit: 8位：

#define mod8(x) x&7
#define div8(x) x>>3

template<unsigned long bits>
struct bitset{
private:
    uint8_t fill[8] = {};
    uint8_t clear[8];
    uint8_t band[(bits/8)+1] = {};

public:
    template<typename T>
    inline bool operator[](const T ind) const{
        return band[div8(ind)]&fill[mod8(ind)];
    }

    template<typename T>
    inline void store_high(const T ind){
        band[div8(ind)] |= fill[mod8(ind)];
    }


    template<typename T>
    inline void store_low(const T ind){
        band[div8(ind)] &= clear[mod8(ind)];

    }
    bitset(){
        for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
            fill[ii] = val;
            clear[ii] = ~fill[ii];
            val*=2;
        }
    }
};

32bit: 32 位：

#define mod32(x) x&31
#define div32(x) x>>5

template<unsigned long bits>
struct bitset{
private:
    uint32_t fill[32] = {};
    uint32_t clear[32];
    uint32_t band[(bits/32)+1] = {};

public:
    template<typename T>
    inline bool operator[](const T ind) const{
        return band[div32(ind)]&fill[mod32(ind)];
    }

    template<typename T>
    inline void store_high(const T ind){
        band[div32(ind)] |= fill[mod32(ind)];
    }


    template<typename T>
    inline void store_low(const T ind){
        band[div32(ind)] &= clear[mod32(ind)];

    }
    bitset(){
        for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
            fill[ii] = val;
            clear[ii] = ~fill[ii];
            val*=2;
        }
    }
};

And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):这是我使用的基准（只是将一个 1 从 position 0 迭代地移动到最后）：

const int len = 1000000;   
bitset<len> bs;

    {
        auto start = std::chrono::high_resolution_clock::now();
        bs.store_high(0);
        for (int ii = 1; ii < len; ++ii) {
            bs.store_high(ii);
            bs.store_low(ii-1);
        }
        auto stop = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
    }

Answer 1

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?由于隐式提升，64 位系统上的数学运算在 32/64 位数据类型上比 short 等较小的数据类型运行得更快，这不是常识吗？

This isn't a universal truth.这不是一个普遍的真理。 As always, fit depends on details.一如既往，合身取决于细节。

Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?为什么这段使用 uint_8 编写的代码在 64 位机器上比使用 uint_32 或 uint_64 编写的类似代码运行得更快？

The title doesn't match the question.标题与问题不符。 There are no such types as uint_X and you aren't using uintX_t .没有uint_X这样的类型，您也没有使用uintX_t 。 You are using uint_fastX_t .您正在使用uint_fastX_t 。 uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations. uint_fastX_t是至少X 字节的 integer 类型的别名，语言实现者认为它可以提供最快的操作。

If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t .如果我们将您之前提到的假设视为理所当然，那么从逻辑上讲，语言实现者会选择使用 32/64 位类型作为uint_fast8_t 。 That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.也就是说，您不能假设他们已经这样做了，并且用于做出该选择的任何通用测量（如果有的话）不一定适用于您的情况。

That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:也就是说，无论uint_fast8_t是哪种类型的别名，您的测试对于比较可能不同的 integer 类型的相对计算速度都是不公平的：

uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};

uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};

Not only are the types (potentially) different, but the sizes of the arrays are too.不仅类型（可能）不同，arrays 的大小也不同。 This can certainly have an effect on the efficiency.这肯定会对效率产生影响。

Answer 2

TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively. TL:DR：bitset 的大“桶”意味着您在线性迭代时重复访问同一个桶，从而创建更长的依赖链，乱序执行不能有效重叠。

Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.较小的桶提供指令级并行性，使对单独字节中的位的操作彼此独立。

On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).可能的原因是你在位上线性迭代，所以同一个band[]元素中的所有操作形成一个&=和|=操作的长依赖链，加上存储和重新加载（如果编译器没有设法优化它远离循环展开）。

For uint32_t band[] , that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.对于uint32_t band[] ，这是一个 2x 32 操作链，因为ii>>5会在那么长的时间内给出相同的索引。

Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler).如果这些长链的延迟和指令数对于 ROB（重新排序缓冲区）和 RS（保留站，又名调度程序）来说太大，则乱序执行只能部分重叠这些长链的执行。 With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline. 64 个操作可能包括存储/重新加载延迟（在现代 x86 上为 4 或 5 个周期），这是一个深度链长度可能为 6 x 64 = 384 个周期，可能至少由 128 uops 组成，具有加载的一些并行性（或更好地计算) 1U<<(n&31)或rotl(-1U, n&31)掩码可以“用完”管道中一些浪费的执行槽。

But for uint8_t band[] , you've moving to a new element 4x as frequently , after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.但是对于uint8_t band[] ，你移动到一个新元素的频率是 4x ，仅经过 2x 8 = 16 次操作，所以 dep 链的长度是 1/4。

See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work.另请参阅了解 lfence 对具有两个长依赖链的循环的影响，以增加现代 x86 CPU 重叠两个长依赖链（一个简单的imul链，没有其他指令级并行性）的另一种情况的长度，尤其是部分关于单个 dep 链变得比 RS（未执行的 uops 的调度程序）更长的时间点，在这个点上我们开始失去一些独立工作执行的重叠。 (For the case without lfence to artificially block overlap.) （针对没有lfence人为遮挡重叠的情况。）

See also Modern Microprocessors A 90-Minute Guide!另请参阅现代微处理器 90 分钟指南！ and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.和https://www.realworldtech.com/sandy-bridge/了解现代 OoO exec CPU 如何解码和查看指令的一些背景知识。

Small vs. large buckets小桶与大桶

Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something .大桶只有在扫描第一个非零位，或填满整个东西时才有用。 Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that.当然，您真的想要使用 SIMD 对其进行矢量化，一次检查 16 或 32 个字节以查看其中的任何位置是否存在非零元素。 Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that.当前的编译器将在填充整个数组的循环中为您矢量化，但不会搜索循环（或任何具有无法在第一次迭代之前计算的行程计数的东西），除了可以处理的 ICC。 Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool> , which is an unfortunate name for a sometimes-useful data structure.)回复：对位向量使用快速操作，请参阅Howard Hinnant 的文章（在vector<bool>的上下文中，这是一个有时有用的数据结构的不幸名称。）

C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.不幸的是，C++ 通常不会让对相同数据使用不同大小的访问变得容易，除非您使用g++ -O3 -fno-strict-aliasing或类似的东西进行编译。

Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever.虽然unsigned char总是可以别名任何其他东西，所以你可以将它用于你的单位访问，只使用uintptr_t （它可能与寄存器一样宽，除了 ILP32-on-64 位 ISA）用于 init 或其他任何东西。 Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.或者在这种情况下， uint_fast32_t在许多 x86-64 C++ 实现上都是 64 位类型，这将使它对此有用，不像通常情况下那样糟糕，当您仅使用 32 位的值范围时浪费缓存占用空间一些 CPU 上的非常数除法比较慢。

On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient.在 x86 CPU 上，字节存储自然是完全有效的，但即使在 ARM 或其他处理器上，存储缓冲区中的合并仍然可以使相邻字节 RMW 完全有效。 ( Are there any modern CPUs where a cached byte store is actually slower than a word store? ). （是否有任何现代 CPU 的缓存字节存储实际上比字存储慢？）。 And you'd still gain ILP;而且您仍然会获得 ILP； a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower.较慢的缓存提交仍然没有将负载耦合到如果更窄则可能独立的存储那么糟糕。 Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.对于具有较小无序调度程序缓冲区的低端 CPU 尤其重要。

(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.) （x86 字节加载需要使用movzx进行零扩展以避免虚假依赖，但大多数编译器都知道这一点。Clang 对此鲁莽，偶尔会造成伤害。）

(Different sized accesses close to each other can lead to store-forwarding stalls, eg a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86? ) （彼此靠近的不同大小的访问会导致存储转发停顿，例如字节存储和与该字节重叠的unsigned long重载将有额外的延迟： What are the costs of failed store-to-load forwarding on x86? ）

Code review:代码审查：

Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs.在大多数 CPU 上，存储掩码数组可能比仅根据需要计算1u32<<(n&31))更糟糕。 If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.如果你真的很幸运，一个聪明的编译器可能会管理从构造函数到基准循环的持续传播，并意识到它可以在循环内旋转或移位以生成位掩码，而不是在已经执行其他 memory 操作的循环中索引 memory .

(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi , with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx , x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.) （一些非 x86 ISA 具有更好的位操作指令并且可以廉价地实现1<<n ，尽管 x86 也可以在 2 条指令中完成，如果编译器很聪明的话。 xor eax,eax / bts eax, esi ，隐含的 BTS通过操作数大小屏蔽移位计数。但这只适用于 32 位操作数大小，而不是 8 位。没有 BMI2 shlx ，x86 变量计数移位在 Intel CPU 上以 3-uops 运行，而 1在 AMD 上。）

Almost certainly not worth it to store both fill[] and clear[] constants.几乎肯定不值得同时存储fill[]和clear[]常量。 Some ISAs even have an andn instruction that can NOT one of the operands on the fly, ie implements (~x) & y in one instruction.一些 ISA 甚至有一个andn指令，它不能是动态操作数之一，即在一条指令中实现(~x) & y 。 For example, x86 with BMI1 extensions has andn .例如，具有 BMI1 扩展名的 x86 具有andn 。 ( gcc -march=haswell ). ( gcc -march=haswell )。

Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1] .此外，您的宏是不安全的：将表达式包装在()中，这样如果您使用foo[div8(x) - 1] ，运算符优先级就不会影响您。 As in #define div8(x) (x>>3)如#define div8(x) (x>>3)

But really, you shouldn't be using CPP macros for stuff like this anyway.但实际上，您无论如何都不应该将 CPP 宏用于此类内容。 Even in modern C, just define static const shift = 3;即使在现代 C 中，也只需定义static const shift = 3; shift counts and masks.班次计数和掩码。 In C++, do that inside the struct/class scope, and use band[idx >> shift] or something.在 C++ 中，在结构/类 scope 中执行此操作，并使用band[idx >> shift]或其他东西。 (When I was typing ind , my fingers wanted to type int ; idx is probably a better name.) （当我输入ind时，我的手指想输入int ； idx可能是一个更好的名字。）

为什么在 64 位机器上使用 uint8_t 编写的这段代码比使用 uint32_t 或 uint64_t 编写的类似代码运行得更快？

问题描述

2 个解决方案

解决方案1
1 2022-04-21 21:19:44

解决方案2
1 已采纳 2022-04-22 02:33:16

Small vs. large buckets小桶与大桶

Code review:代码审查：

为什么在 64 位机器上使用 uint8_t 编写的这段代码比使用 uint32_t 或 uint64_t 编写的类似代码运行得更快？

问题描述

2 个解决方案

解决方案1 1 2022-04-21 21:19:44

解决方案2 1 已采纳 2022-04-22 02:33:16

Small vs. large buckets小桶与大桶

Code review:代码审查：

解决方案1
1 2022-04-21 21:19:44

解决方案2
1 已采纳 2022-04-22 02:33:16