简体   繁体   English

使用位操作对两个整数进行十进制连接

[英]Decimal concatenation of two integers using bit operations

I want to concatenate two integers using only bit operations since i need efficiency as much as possible.There are various answers available but they are not fast enough what I want is implementation that uses only bit operations like left shift,or and etc. Please guide me how to do it. 我想仅使用位运算来连接两个整数,因为我需要尽可能地提高效率。有各种可用的答案,但它们的速度不够快,我想要的是仅使用位运算(如左移或等)的实现。请指导我该怎么做。

example

int x=32;
int y=12;
int result=3212;

I am having and FPGA implentation of AES.I need this on my system to reduce time consumption of some task 我拥有AES的FPGA配置,我需要在我的系统上这样做以减少某些任务的时间消耗

The most efficient way to do it, is likely something similar to this: 最有效的方法可能与此类似:

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}

Then let the compiler worry about optimization. 然后让编译器担心优化。 There's likely not a lot it can improve since this is base 10, which doesn't get along well with the various instructions of the computer, like shifting. 由于这是以10为基数的,因此可能没有太多改善的地方,它与计算机的各种指令(例如移位)配合得不好。


EDIT : BENCHMARK TEST 编辑:基准测试

Intel i7-3770 2 3,4 GHz
OS: Windows 7/64
Mingw, GCC version 4.6.2
gcc -O3 -std=c99 -pedantic-errors -Wall

10 million random values, from 0 to 3276732767.

Result (approximates): 结果(近似值):

Algorithm 1: 60287 micro seconds
Algorithm 2: 65185 micro seconds

Benchmark code used: 使用的基准代码:

#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#include <time.h>

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}


uint32_t myConcat (uint32_t a, uint32_t b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;

        default: return a*10+b; break;
    }
}


static LARGE_INTEGER freq;

static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end)
{
  LARGE_INTEGER elapsed;

  elapsed.QuadPart = end->QuadPart - start->QuadPart;
  elapsed.QuadPart *= 1000000;
  elapsed.QuadPart /= freq.QuadPart;

  printf("%lu micro seconds", elapsed.QuadPart);
}

int main()
{
  const uint32_t TEST_N = 10000000;
  uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N);
  uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N);

  srand (time(NULL));
  // Mingw rand() apparently gives numbers up to 32767
  // worst case should therefore be 3,276,732,767

  // fill up random data in arrays
  for(uint32_t i=0; i<TEST_N; i++)
  {
    data1[i] = rand();
    data2[i] = rand();
  }


  QueryPerformanceFrequency(&freq); 


  LARGE_INTEGER start, end;

  // run algorithm 1
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo1[i] = uintcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 1: ");
  print_benchmark_results(&start, &end);
  printf("\n");

  // run algorithm 2
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo2[i] = myConcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 2: ");
  print_benchmark_results(&start, &end);
  printf("\n\n");


  // sanity check both algorithms against each other
  for(uint32_t i=0; i<TEST_N; i++)
  {
    if(result_algo1[i] != result_algo2[i])
    {
      printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n",
             data1[i], 
             data2[i],
             data1[i],
             data2[i],
             result_algo1[i],
             result_algo2[i]);
    }
  }


  // clean up
  free((void*)data1);
  free((void*)data2);
  free((void*)result_algo1);
  free((void*)result_algo2);
}

Bit operations use the binary representation of the numbers. 位运算使用数字的二进制表示形式。 However what you try to achieve is to concatenate the numbers in decimal notation. 但是,您尝试实现的是将数字连接为十进制表示法。 Please note that concatenating the decimal representations has little to do with concatenating the binary representations. 请注意,串联十进制表示形式与串联二进制表示形式无关。 Though it is theoretically possible to solve the problem using binary operations I am sure it will be far from the most efficient way. 尽管从理论上讲可以使用二进制运算来解决问题,但我相信这将不是最有效的方法。

We need to calculate a*10^N + b very fast. 我们需要非常快速地计算a * 10 ^ N + b。

Bit operations isn't a best idea to optimize it (even using tricks like a := (a<<1) + (a<<3) <=> a := a*10 as compiler can make it himself). 位操作不是优化它的最佳方法(甚至可以使用诸如:=(a << 1)+(a << 3)<=> a:= a * 10之类的技巧,因为编译器可以自行实现)。

The first problem is to calculate 10^N, but there is no need to calculate it, there is just 9 possible values. 第一个问题是计算10 ^ N,但是不需要计算,只有9个可能的值。

The second problem is to calculate N from b (length of 10 representation). 第二个问题是从b(10个表示的长度)计算N。 If your data have uniform distribution you can minimize operations count in average case. 如果您的数据具有均匀分布,则可以在平均情况下最大程度地减少操作次数。

Check b <= 10^9, b <= 10^8, ..., b <= 10 with ()?: (it's faster than if( ) after optimizations, it has much simplier grammar and functionality), call the result N. Next, make switch(N) with lines "return a*10^N + b" (where 10^N is constant). 用()?检查b <= 10 ^ 9,b <= 10 ^ 8,...,b <= 10?(优化后比if()快,它具有更简单的语法和功能),调用结果N.接下来,使用“ return a * 10 ^ N + b”行(其中10 ^ N为常数)制作switch(N)。 As I know, switch() with 3-4 "case" is faster than same if( ) construction after optimizations. 据我所知,具有3-4个“ case”的switch()比优化后的相同if()构造要快。

unsigned int myConcat(unsigned int& a, unsigned int& b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;
        default: return a*10+b; break;
        // I don't really know what to do here
        //case 8: return a*1000*1000*1000+b; break;
        //case 9: return a*10*1000*1000*1000+b; break;
    }
}

As you can see, there is 2-3 operations in average case + optimisations is very effective here. 如您所见,在一般情况下,需要进行2-3次操作+优化在这里非常有效。 I've benchmarked it in comparison with Lundin's suggestion, here is the result . 我将它与伦丁的建议进行了基准比较, 结果如下 0ms vs 100ms 0ms和100ms

If you care about decimal digit concatenation, you might want to simply do that as you're printing, and convert both numbers to a sequence of digits sequentially. 如果您关心十进制数字的级联,则可以在打印时简单地做到这一点,然后将两个数字依次转换为数字序列。 eg How do I print an integer in Assembly Level Programming without printf from the c library? 例如, 如何在不使用c库中的printf的情况下,在汇编级编程中打印整数? shows an efficient C function, as well as asm. 显示了一个有效的C函数以及asm。 Call it twice into the same buffer. 两次调用相同的缓冲区。


@Lundin's answer loops increasing powers of 10 to find the right decimal-shift, ie a linear search for the right power of 10. If it's called very frequently so a lookup table can stay hot in cache, a speedup maybe be possible. @Lundin的answer循环增加10的幂以找到正确的十进制移位,即线性搜索10的正确幂。如果调用频率很高,则查找表可以在高速缓存中保持高温,则有可能实现加速。

If you can use GNU C __builtin_clz (Count Leading Zeros) or some other way of quickly finding the MSB position of the right-hand input ( ls , the least-significant part of the resulting concatenation), you can start the search for the right mult from a 32-entry lookup table. 如果可以使用GNU C __builtin_clz (计数前导零)或通过其他方式快速找到右侧输入的MSB位置( ls ,结果串联的最低有效部分),则可以开始搜索右侧mult从32条目的查找表。 (And you only have to check at most one more iteration, so it's not a loop.) (而且您最多只需要再检查一次迭代,因此它不是循环。)

Most of the common modern CPU architectures have a HW instruction that the compiler can use directly or with a bit of processing to implement clz. 大多数常见的现代CPU体系结构都有HW指令,编译器可以直接使用HW指令,也可以进行一些处理以实现clz。 https://en.wikipedia.org/wiki/Find_first_set#Hardware_support . https://zh.wikipedia.org/wiki/Find_first_set#Hardware_support (And on all but x86, the result is well-defined for an input of 0, but unfortunately GNU C doesn't portably give us access to that.) (在x86以外的所有语言上,输入0都明确定义了结果,但是不幸的是GNU C不能方便地允许我们访问它。)

If the table stays hot in L1d cache, this can be pretty good . 如果表在L1d高速缓存中保持热态,那么可能会很好 The extra latency of a clz and a table lookup are comparable to a couple iterations of the loop (on a modern x86 like Skylake or Ryzen for example, where bsf or tzcnt is 3 cycle latency, L1d latency is 4 or 5 cycles, imul latency is 3 cycles.) clz和表查找的额外延迟可与循环的几次迭代相媲美(例如,在现代x86(如Skylake或Ryzen)上,其中bsftzcnt是3个周期的延迟,L1d延迟是4或5个周期, imul延迟是3个周期。)

Of course, on many architectures (including x86), multiplying by 10 is cheaper than by a runtime variable, using shift and add. 当然,在许多体系结构(包括x86)上,使用shift和add乘以10比运行时变量便宜。 2 LEA instructions on x86, or an add + lsl on ARM/AArch64 using a shifted input to do tmp = x + x*4 with the add. x86上的2条LEA指令,或ARM / AArch64上的add + lsl ,使用移位输入对add执行tmp = x + x*4 So on Intel CPUs, we're only looking at a 2-cycle loop-carried dependency chain, not 3. But AMD CPUs have slower LEA when using a scaled index. 因此,在Intel CPU上,我们仅查看的是2循环循环依赖关系链,而不是3。但是,使用缩放索引时,AMD CPU的LEA较慢。

This doesn't sound great for small numbers. 对于小数目来说,这听起来并不好。 But it can reduce branch mispredictions by needing at most one iteration. 但是, 它最多需要一次迭代就可以减少分支的错误预测。 It even makes a branchless implementation possible . 它甚至可以实现无分支的实现 And it means less total work for large lower parts (large powers of 10). 而且这意味着较大的下部零件(10的大功率)的总工作量较少。 But large integers will easily overflow unless you use a wider result type. 但是,除非使用更广泛的结果类型,否则大整数很容易溢出。


Unfortunately, 10 is not a power of 2, so the MSB position alone can't give us the exact power of 10 to multiply by. 不幸的是,10并不是2的幂,因此仅MSB位置不能给我们确切的10的幂。 eg all numbers from 64 to 127 all have MSB = 1<<7 , but some of them have 2 decimal digits and some have 3. Since we want to avoid division (because it requires a multiplication by a magic constant and shifting the high half), we want to always start with the lower power of 10 and see if that's big enough. 例如,从64到127的所有数字均具有MSB = 1<<7 ,但是其中一些具有2个十进制数字,而另一些具有3。由于我们要避免除法(因为它需要乘以魔术常数并乘以高半部分),我们总是要从10的较低幂开始,看看是否足够大。

But fortunately, a bitscan does get us within one power of 10 so we no longer need a loop. 但幸运的是,位扫描确实使我们获得了10的整数倍,因此我们不再需要循环。

I probably wouldn't have written the part with _lzcnt_u32 or ARM __clz if I'd learned of the clz(a|1) trick for avoiding problems with input=0 beforehand. 如果我事先了解了避免输入= 0的问题的clz(a|1)技巧,则可能不会用_lzcnt_u32或ARM __clz编写该部分。 But I did, and played around with the source a bit to try to get nicer asm from gcc and clang. 但是我做到了,并尝试使用源代码尝试从gcc和clang获得更好的asm。 Index the table on clz or BSR depending on target platform makes it a bit of a mess. 根据目标平台在clz或BSR上对表进行索引会使它有些混乱。

#include <stdint.h>
#include <limits.h>
#include <assert.h>

   // builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0
   // actual x86 HW leaves the destination register unmodified; AMD even documents this.
   // but GNU C doesn't let us take advantage with intrinsics.
   // unless you use BMI1 _lzcnt_u32


// if available, use an intrinsic that gives us a leading-zero count
// *without* an undefined result for input=0
#ifdef __LZCNT__      // x86 CPU feature
#include <immintrin.h>  // Intel's intrinsics
#define HAVE_LZCNT32
#define lzcnt32(a) _lzcnt_u32(a)
#endif

#ifdef __ARM__      // TODO: do older ARMs not have this?
#define HAVE_LZCNT32
#define lzcnt32(a) __clz(a)  // builtin, no header needed
#endif

// Some POWER compilers define `__cntlzw`?



// index = msb position, or lzcnt, depending on which the HW can do more efficiently
// defined later; one or the other is unused and optimized out, depending on target platform
// alternative: fill this at run-time startup
// with a loop that does mult*=10 when (x<<1)-1 > mult, or something
//#if INDEX_BY_MSB_POS == 1
  __attribute__((unused))
  static const uint32_t catpower_msb[] = {
       10,    // 1 and 0
       10,    // 2..3
       10,    // 4..7
       10,    // 8..15
       100,    // 16..31     // 2 digits even for the low end of the range
       100,    // 32..63
       100,    // 64..127
       1000,   // 128..255   // 3 digits
       1000,   // 256..511
       1000,   // 512..1023
       10000,   // 1024..2047
       10000,   // 2048..4095
       10000,   // 4096..8191
       10000,   // 8192..16383
       100000,   // 16384..32767
       100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       //  ...   // fill in the rest yourself
  };
//#elif INDEX_BY_MSB_POS == 0
  // index on leading zeros
  __attribute__((unused))
  static const uint32_t catpower_lz32[] = {
      // top entries overflow: 10^10 doesn't fit in uint32_t
      // intentionally wrong to make it easier to spot bad output.
    4000000000,    // 2^31 .. 2^32-1    2*10^9 .. 4*10^9
    2000000000,    // 1,073,741,824 .. 2,147,483,647
    // first correct entry
    1000000000,    //   536,870,912 .. 1,073,741,823

    // ... fill in the rest
    // for testing, skip until 16 leading zeros
    [16] = 100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       100000,   // 16384..32767
       10000,   // 8192..16383
       10000,   // 4096..8191
       10000,   // 2048..4095
       10000,   // 1024..2047
       1000,   // 512..1023
       1000,   // 256..511
       1000,   // 128..255
       100,    // 64..127
       100,    // 32..63
       100,    // 16..31     // low end of the range has 2 digits
       10,    // 8..15
       10,    // 4..7
       10,    // 2..3
       10,    // 1
                       // lzcnt32(0) == 32
       10,    // 0     // treat 0 as having one significant digit.
  };
//#else
//#error "INDEX_BY_MSB_POS not set correctly"
//#endif



//#undef HAVE_LZCNT32  // codegen for the other path, for fun

static inline uint32_t msb_power10(uint32_t a)
{
#ifdef HAVE_LZCNT32  // 0-safe lzcnt32 macro available
    #define INDEX_BY_MSB_POS 0
    // a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes
    unsigned lzcnt = lzcnt32(a);  // 32 for a=0
    return catpower_lz32[lzcnt];
#else
  // only generic __builtin_clz available

  static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit");
  // See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html
  // for C++ templates for fixed-width wrappers for __builtin_clz

  #if defined(__i386__) || defined(__x86_64__)
    // x86 where MSB_index = 31-clz = BSR is most efficient
    #define INDEX_BY_MSB_POS 1
    unsigned msb = 31 - __builtin_clz(a|1);  // BSR
    return catpower_msb[msb];
    //return unlikely(a==0) ? 10 : catpower_msb[msb];
  #else
    // use clz directly while still avoiding input=0
    // I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width),
    // but gcc's builtin is still documented as not valid for input=0
    // Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index

    // set the LSB to avoid the a==0 special case
    unsigned clz = __builtin_clz(a|1);
    // table[32] unused, could add yet another #ifdef for that
    #define INDEX_BY_MSB_POS 0
    //return unlikely(a==0) ? 10 : catpower_lz32[clz];
    return catpower_lz32[clz];   // a|1 avoids the special-casing
  #endif  // optimize for BSR or not
#endif // HAVE_LZCNT32
}


uint32_t uintcat (uint32_t ms, uint32_t ls)
{
//  if (ls==0) return ms * 10;  // Another way to avoid the special case for clz

  uint32_t mult = msb_power10(ls); // catpower[clz(ls)];
  uint32_t high = mult * ms;
#if 0
  if (mult <= ls)
      high *= 10;
  return high + ls;
#else
  // hopefully compute both and then select
  // because some CPUs can shift and add at the same time (x86, ARM)
  // so this avoids having an ADD *after* the cmov / csel, if the compiler is smart
  uint32_t another10 = high*10 + ls;
  uint32_t enough = high + ls; 
  return (mult<=ls) ? another10 : enough;
#endif
}

From the Godbolt compiler explorer , this compiles efficiently for x86-64 with and without BSR: 从Godbolt编译器资源管理器中 ,可以在带有和不带有BSR的x86-64上高效地进行编译:

# clang7.0 -O3 for x86-64 SysV,  -march=skylake -mno-lzcnt
uintcat(unsigned int, unsigned int):
    mov     eax, esi
    or      eax, 1
    bsr     eax, eax                    # 31-clz(ls|1)
    mov     ecx, dword ptr [4*rax + catpower_msb]
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]          # retval = high * 10
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high   (drop the *10 result)
    add     eax, esi                    # retval += ls
    ret

Or with lzcnt, (enabled by -march=haswell or later, or some AMD uarches), 使用 lzcnt((由-march=haswell或更高版本启用,或某些AMD uarches启用),

uintcat(unsigned int, unsigned int):
          # clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax
    lzcnt   eax, esi                    # C source avoids the |1, saving instructions
    mov     ecx, dword ptr [4*rax + catpower_lz32]
    imul    edi, ecx                    # same as above from here on
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]
    cmp     ecx, esi
    cmova   eax, edi
    add     eax, esi
    ret

Factoring the last add out of both sides of the ternary is a missed optimization, adding 1 cycle of latency after the cmov . 从三元数的两边考虑最后的add是错过的优化,在cmov之后增加了1个周期的延迟。 We can multiply by 10 and add just as cheaply as multiplying by 10 alone, on Intel CPUs: 在Intel CPU上,我们可以乘以10并便宜地乘以10。

    ... same start         # hand-optimized version that clang should use
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + 4*rdi]          # high * 5
    lea     eax, [rsi + rdi*2]          # retval = high * 10 + ls
    add     edi, esi                    # tmp = high + ls
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high+ls
    ret

So the high + ls latency would run in parallel with the high*10 + ls latency, both needed as inputs for cmov . 因此, high + ls延迟将与high*10 + ls延迟并行运行,这两者都是cmov输入。

GCC branches instead of using CMOV for the last condition. GCC分支而不是最后一个条件使用CMOV。 GCC also makes a mess of 31-clz(a|1) , calculating clz with BSR and XOR with 31. But then subtracting that from 31. And it has some extra mov instructions. GCC也使得乱七八糟的31-clz(a|1)计算clzBSRXOR与31但随后减去从31而且它有一些额外的mov指令。 Strangely, gcc seems to do better with that BSR code when lzcnt is available, even though it chooses not to use it. 奇怪的是,即使lzcnt可用,gcc似乎也可以更好地使用该BS​​R代码,即使它选择不使用它。

clang has no trouble optimizing away the 31-clz double-inversion and just using BSR directly. clang可以31-clz优化31-clz双重反转,而无需直接使用BSR。

For PowerPC64, clang also makes branchless asm. 对于PowerPC64,clang也使无分支的asm。 gcc does something similar, but with a branch like on x86-64. gcc的功能类似,但分支类似于x86-64。

uintcat:
.Lfunc_gep0:
    addis 2, 12, .TOC.-.Lfunc_gep0@ha
    addi 2, 2, .TOC.-.Lfunc_gep0@l
    ori 6, 4, 1                              # OR immediate
    addis 5, 2, catpower_lz32@toc@ha
    cntlzw 6, 6                              # CLZ  word
    addi 5, 5, catpower_lz32@toc@l           # static table address
    rldic 6, 6, 2, 30                        # rotate left and clear immediate (shift and zero-extend the CLZ result)
    lwzx 5, 5, 6                             # Load Word Zero eXtend, catpower_lz32[clz]
    mullw 3, 5, 3                            # mul word
    cmplw   5, 4                             # compare   mult, ls
    mulli 6, 3, 10                           # mul immediate
    isel 3, 3, 6, 1                          # conditional select high vs. high*10
    add 3, 3, 4                              # + ls
    clrldi  3, 3, 32                         # zero extend, clearing upper 32 bits
    blr                                      # return

Compressing the table 压缩表

Using clz(ls|1) >> 1 or that +1 should work, because 4 < 10. The table always takes at least 3 entries to gain another digit. 使用clz(ls|1) >> 1或+1应该起作用,因为4 <10。该表始终至少需要3个条目才能获得另一个数字。 I haven't investigated this. 我还没有对此进行调查。 (And have already spent longer than I meant to on this. :P) (并且已经花了比我本来打算更长的时间。

Or right-shift a lot more to just get a starting point for the loop. 或右移更多以获取循环的起点。 eg mult = clz(ls) >= 18 ? 100000 : 10; 例如mult = clz(ls) >= 18 ? 100000 : 10; mult = clz(ls) >= 18 ? 100000 : 10; , or a 3 or 4-way chain of if . if的3或4链。


Or loop on mult *= 100 , and after exiting that loop sort out whether you want old_mult * 10 or mult . 或者在mult *= 100上循环,然后退出该循环,然后选择是否要使用old_mult * 10mult (ie check if you went too far). (即检查您是否走得太远)。 This cuts the iteration count in half for even numbers of digits. 对于偶数个数字,这会将迭代计数减少一半。

( Watch out for a possible infinite loop on large ls that would overflow the result. If mult *= 100 wraps to 0, it will always stay <= ls for ls = 1000000000 , for example.) (请注意在大ls上可能出现的无限循环,该溢出会溢出结果。例如,如果mult *= 100换为0,则对于ls = 1000000000 ,它将始终保持<= ls 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM