简体   繁体   English

大整数计数器的快速实现(在 C/C++ 中)

[英]Fast implementation of a large integer counter (in C/C++)

My goal is as the following,我的目标如下,

Generate successive values, such that each new one was never generated before, until all possible values are generated.生成连续值,这样之前从未生成过每个新值,直到生成所有可能的值。 At this point, the counter start the same sequence again.此时,计数器再次启动相同的序列。 The main point here is that, all possible values are generated without repetition (until the period is exhausted).这里的要点是,所有可能的值都是在不重复的情况下生成的(直到周期用完)。 It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.序列是简单的 0, 1, 2, 3,... 或其他顺序都没有关系。

For example, if the range can be represented simply by an unsigned , then例如,如果范围可以简单地用unsigned表示,那么

void increment (unsigned &n) {++n;}

is enough.足够。 However, the integer range is larger than 64-bits.但是,整数范围大于 64 位。 For example, in one place, I need to generated 256-bits sequence.例如,在一个地方,我需要生成 256 位序列。 A simple implementation is like the following, just to illustrate what I am trying to do,一个简单的实现如下所示,只是为了说明我正在尝试做的事情,

typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
    if (ctr[0] < max) {++ctr[0]; return;}
    if (ctr[1] < max) {++ctr[1]; return;}
    if (ctr[2] < max) {++ctr[2]; return;}
    if (ctr[3] < max) {++ctr[3]; return;}
    ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}

So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max , and then ctr[1] , and so on.因此,如果ctr以全零开始,则首先ctr[0]一一增加直到达到max ,然后ctr[1] ,依此类推。 If all 256-bits are set, then we reset it to all zero, and start again.如果设置了所有 256 位,那么我们将其重置为零,然后重新开始。

The problem is that, such implementation is surprisingly slow.问题是,这种实现速度出奇地慢。 My current improved version is sort of equivalent to the following,我目前的改进版本相当于以下内容,

void increment (ctr_type &ctr)
{
    std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
    if (k < 4)
        ++ctr[k];
    else
        memset(ctr.data(), 0, 32);

}

If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0 .如果计数器仅使用上述increment函数操作,并且始终从零开始,则ctr[k] == 0 if ctr[k - 1] == 0 And thus the value k will be the index of the first element that is less than the maximum.因此,值k将是小于最大值的第一个元素的索引。

I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations.我希望第一个更快,因为每 2^64 次迭代只会发生一次分支错误预测。 The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference.第二个,虽然错误预测只发生在每 2^256 次迭代,但它不会有什么不同。 And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition.除了分支,它需要四个按位求反,四个布尔求反和三个加法。 Which might cost much more than the first.这可能比第一个花费更多。

However, both clang , gcc , or intel icpc generate binaries that the second was much faster.但是, clanggcc或 intel icpc生成的二进制文件第二个要快得多。

My main question is that does anyone know if there any faster way to implement such a counter?我的主要问题是有谁知道是否有更快的方法来实现这样的计数器? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.只要该算法生成 256 位的所有 2^256 个组合,计数器是从增加第一个整数开始还是以整数数组的形式实现都没有关系。

What makes things more complicated, I also need non uniform increment.是什么让事情变得更复杂,我还需要非统一增量。 For example, each time the counter is incremented by K where K > 1 , but almost always remain a constant.例如,每次计数器增加KK > 1 ,但几乎总是保持不变。 My current implementation is similar to the above.我目前的实现与上述类似。

To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions.为了提供更多上下文,我使用计数器的一个地方是将它们用作 AES-NI aesenc 指令的输入。 So distinct 128-bits integer (loaded into __m128i ), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated.如此不同的 128 位整数(加载到__m128i ),经过 10(或 12 或 14,取决于密钥大小)轮指令后,生成不同的128-bits整数。 If I generate one __m128i integer at once, then the cost of increment matters little.如果我一次生成一个__m128i整数,那么increment的成本就无关紧要了。 However, since aesenc has quite a bit latency, I generate integers by blocks.然而,由于 aesenc 有相当多的延迟,我按块生成整数。 For example, I might have 4 blocks, ctr_type block[4] , initialized equivalent to the following,例如,我可能有 4 个块, ctr_type block[4] ,初始化等价于以下内容,

block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);

And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once.每次需要新输出时,我increment每个block[i]增加 4,并一次生成 4 个__m128i输出。 By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks.通过交错指令,总体而言,当使用 2 个 64 位整数作为计数器和 8 个块时,我能够提高吞吐量,并将每字节输出 (cpB) 的周期从 6 减少到 0.9。 However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half.但是,如果改为使用 4 个 32 位整数作为计数器,则以每秒字节数衡量的吞吐量将减少一半。 I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations.我知道在 x86-64 上,在某些情况下,64 位整数可能比 32 位整数快。 But I did not expect such simple increment operation makes such a big difference.但是没想到这么简单的增量操作竟然有这么大的不同。 I have carefully benchmarked the application, and the increment is indeed the one slow down the program.我仔细地对应用程序进行了基准测试, increment确实是减慢程序的原因。 Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented.由于加载到__m128i并将__m128i输出存储到可用的 32 位或 64 位整数是通过对齐的指针完成的,因此 32 位和 64 位版本之间的唯一区别是计数器的递增方式。 I expected that the AES-NI expected, after loading the integers into __m128i , shall dominate the performance.我预计 AES-NI 在将整数加载到__m128i ,将主导性能。 But when using 4 or 8 blocks, it was clearly not the case.但是当使用 4 或 8 个块时,情况显然不是这样。

So to summary, my main question is that, if anyone know a way to improve the above counter implementation.总而言之,我的主要问题是,是否有人知道改进上述计数器实现的方法。

It's not only slow, but impossible.这不仅缓慢,而且不可能。 The total energy of universe is insufficient for 2^256 bit changes.宇宙的总能量不足以进行 2^256 位变化。 And that would require gray counter.这将需要灰色计数器。

Next thing before optimization is to fix the original implementation优化前的下一步是修复原始实现

void increment (ctr_type &ctr)
{
    if (++ctr[0] != 0) return;
    if (++ctr[1] != 0) return;
    if (++ctr[2] != 0) return;
    ++ctr[3];
}

If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9 , 19,29,39,49,...99 , 199,299,... and 1999,2999,3999,..., 9999 .如果不允许每个ctr[i]溢出到零,那么周期将仅为 4*(2^32),如0-919,29,39,49,...99 , 199,299,...1999,2999,3999,..., 9999 .

As a reply to the comment -- it takes 2^64 iterations to have the first overflow.作为对评论的回复 - 第一次溢出需要 2^64 次迭代。 Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out.慷慨地说,一秒钟内最多可以进行 2^32 次迭代,这意味着程序应该运行 2^32 秒才能执行第一次。 That's about 136 years.那是大约 136 年。

EDIT编辑

If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:如果具有 2^66 个状态的原始实现确实是需要的,那么我建议将界面和功能更改为如下所示:

  (*counter) += 1;
  while (*counter == 0)
  {
     counter++;  // Move to next word
     if (counter > tail_of_array) {
        counter = head_of_array;
        memset(counter,0, 16);
        break;
     }
  }

The point being, that the overflow is still very infrequent.关键是,溢出仍然很少发生。 Almost always there's just one word to be incremented.几乎总是只有一个词要增加。

If you're using GCC or compilers with __int128 like Clang or ICC如果您使用 GCC 或带有__int128编译器,例如 Clang 或 ICC

unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;

On systems where __int128 isn't available__int128不可用的系统上

std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
    c[1]++;
    if (c[1] == 0)
    {
        c[2]++;
        if (c[2] == 0)
        {
            c[3]++;
        }
    }
}

In inline assembly it's much easier to do this using the carry flag.在内联汇编中,使用进位标志更容易做到这一点。 Unfortunately most high level languages don't have means to access it directly.不幸的是,大多数高级语言都无法直接访问它。 Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll一些编译器确实有内在与携带加入像__builtin_uaddll_overflow在GCC和__builtin_addcll

Anyway this is rather wasting time since the total number of particles in the universe is only about 10 80 and you cannot even count up the 64-bit counter in your life无论如何,这是相当浪费时间,因为宇宙中的粒子总数只有大约 10 80 个,而且您甚至无法计算您一生中的 64 位计数器

Neither of your counter versions increment correctly.您的计数器版本都没有正确递增。 Instead of counting up to UINT256_MAX , you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again.而不是计数到UINT256_MAX ,您实际上只是计数到UINT64_MAX 4 次,然后再次从 0 开始。 This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value.这从以下事实中可以明显看出,您不必费心清除任何已达到最大值的索引,直到所有索引都达到最大值为止。 If you are measuring performance based on how often the counter reaches all bits 0, then this is why.如果您根据计数器达到所有位 0 的频率来衡量性能,那么这就是原因。 Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.因此,您的算法不会生成 256 位的所有组合,这是规定的要求。

You mention "Generate successive values, such that each new one was never generated before"您提到“生成连续值,以便以前从未生成过每个新值”

To generate a set of such values, look at linear congruential generators要生成一组这样的值,请查看线性同余生成器

  • the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.序列 x = (x*1 + 1) % (power_of_2),你想过,这只是序列号。

  • the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random.序列 x = (x*13 + 137) % (power of 2) ,这会生成具有可预测周期 (power_of_2 - 1) 的唯一数字,并且唯一数字看起来更“随机”,有点像伪随机。 You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants.您需要求助于任意精度算术才能使其工作,以及所有乘法常数的技巧。 This will get you a nice way to start.这将为您提供一个很好的开始方式。

You also complain that your simple code is "slow"你还抱怨你的简单代码“慢”

At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*2 32 =8796093022208 increments per second, that is 2 64 increments reached in 25 days.在 4.2 GHz 频率下,每个周期运行 4 次指令并使用 AVX512 矢量化,在具有多线程版本的程序的 64 核计算机上,除了增量之外什么都不做,您只能获得 64x8x4*2 32 =8796093022208 每秒增量,即 2在 25 天内达到了64 个增量。 This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.这个帖子很老了,你现在可能已经达到841632698362998292480,在这样的机器上运行这样的程序,你将在2年内光荣地达到1683265396725996584960。

You also require "until all possible values are generated" .您还需要“直到生成所有可能的值”

You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers.您只能生成有限数量的值,具体取决于您愿意为为计算机供电的能源支付多少。 As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:正如其他回复中提到的,对于 128 或 256 位的数字,即使是世界上最富有的人,您也永远不会在这些条件中的第一个出现之前回过头来:

  • getting out of money没钱了
  • end of humankind (nobody will get the outcome of your software)人类的终结(没有人会得到你的软件的结果)
  • burning the energy from the last particles of the universe燃烧宇宙最后粒子的能量

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:多字加法可以通过使用三个宏以可移植的方式轻松完成,这些宏模拟了许多处理器上的三种加法指令:

ADDcc adds two words, and sets the carry if their was unsigned overflow ADDcc添加两个字,如果它们是无符号溢出,则设置进位
ADDC adds two words plus carry (from a previous addition) ADDC添加两个字加进位(来自之前的添加)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow ADDCcc添加两个字加进位,如果它们是无符号溢出,则设置进位

A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words.具有两个字的多字加法使用最低有效字的ADDcc ,然后是最高有效字的ADCC A multi-word addition with more than two words forms sequence ADDcc , ADDCcc , ..., ADDC .具有两个以上词的多词相加形成序列ADDcc , ADDCcc , ..., ADDC The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. MIPS 架构是一种没有条件代码的处理器架构,因此没有进位标志。 The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.下面显示的宏实现基本上遵循 MIPS 处理器上用于多字加法的技术。

The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words".下面的 ISO-C99 代码显示了基于 16 位“字”的 32 位计数器和 64 位计数器的操作。 I chose arrays as the underlying data structure, but one might also use struct , for example.我选择数组作为底层数据结构,但也可以使用struct ,例如。 Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated.如果每个操作数只包含几个字,那么使用struct速度会明显加快,因为消除了数组索引的开销。 One would want to use the widest available integer type for each "word" for best performance.人们希望为每个“单词”使用最广泛的可用整数类型以获得最佳性能。 In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.在问题的示例中,可能是一个包含四个uint64_t组件的 256 位计数器。

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)

#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)

#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

typedef uint16_t T;

/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
    T cy, t0, t1;
    counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
    for (int i = 1; i < (n - 1); i++) {
        counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
    }
    counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}

#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)

int main (void)
{
    uint32_t count32 = 0, incr32 = INCREMENT;
    T count_arr2 [UINT32_ARRAY_LEN] = {0};
    T incr_arr2  [UINT32_ARRAY_LEN] = {INCREMENT};
    do {
        count32 = count32 + incr32;
        inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
    } while (count32 < (0U - INCREMENT - 1));
    printf ("count32 = %08x  arr_count = %08x\n", 
            count32, (((uint32_t)count_arr2 [1] << 16) +
                      ((uint32_t)count_arr2 [0] <<  0)));

    uint64_t count64 = 0, incr64 = INCREMENT;
    T count_arr4 [UINT64_ARRAY_LEN] = {0};
    T incr_arr4  [UINT64_ARRAY_LEN] = {INCREMENT};
    do {
        count64 = count64 + incr64;
        inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
    } while (count64 < 0xa987654321ULL);
    printf ("count64 = %016llx  arr_count = %016llx\n", 
            count64, (((uint64_t)count_arr4 [3] << 48) + 
                      ((uint64_t)count_arr4 [2] << 32) +
                      ((uint64_t)count_arr4 [1] << 16) +
                      ((uint64_t)count_arr4 [0] <<  0)));
    return EXIT_SUCCESS;
}

Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC.经过全面优化编译,32 位示例在大约一秒钟内执行,而 64 位示例在现代 PC 上运行大约一分钟。 The output of the program should look like so:程序的输出应如下所示:

count32 = fffffffa  arr_count = fffffffa
count64 = 000000a987654326  arr_count = 000000a987654326

Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.基于内联汇编或宽整数类型的专有扩展的不可移植代码的执行速度大约是此处介绍的可移植解决方案的两到三倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM