简体   繁体   English

是否有更有效的方法来获取32位整数的长度(以字节为单位)?

[英]Is there a more efficient way to get the length of a 32bit integer in bytes?

I'd like a shortcut for the following little function, where performance is very important (the function is called more than 10.000.000 times): 我想要一个以下小函数的快捷方式,其中性能非常重要(该函数被调用超过10.000.000次):

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

Does anyone have any idea... a cool bitoperation trick? 有没有人有任何想法...一个很酷的bitoperation技巧? Thanks for your help in advance! 感谢您的帮助!

How about this one? 这个怎么样?

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

Removing the inline keyword, g++ -O2 compiles this to the following branchless code: 删除inline关键字, g++ -O2将其编译为以下无分支代码:

movl    8(%ebp), %edx
movl    %edx, %eax
andl    $-16777216, %eax
cmpl    $1, %eax
sbbl    %eax, %eax
addl    $4, %eax
xorl    %ecx, %ecx
testl   $-65536, %edx
sete    %cl
subl    %ecx, %eax
andl    $-256, %edx
sete    %dl
movzbl  %dl, %edx
subl    %edx, %eax

If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. 如果您不介意特定于机器的解决方案,可以使用搜索前1位的bsr指令。 Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4: 然后,您只需将8除以将位转换为字节,再加1以将范围0..3移至1..4:

int len(uint32 val)
{
    asm("mov 8(%ebp), %eax");
    asm("or  $255, %eax");
    asm("bsr %eax, %eax");
    asm("shr $3, %eax");
    asm("inc %eax");
    asm("mov %eax, 8(%ebp)");
    return val;
}

Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. 请注意,我不是内联汇编之神,所以也许有更好的解决方案来访问val而不是显式地寻址堆栈。 But you should get the basic idea. 但你应该得到基本的想法。

The GNU compiler also has an interesting built-in function called __builtin_clz : GNU编译器还有一个有趣的内置函数__builtin_clz

inline int len(uint32 val)
{
    return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}

This looks much better than the inline assembly version to me :) 这看起来比内联汇编版本要好得多:)

I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler. 我做了一个迷你的不科学的基准测试,只是在VS 2010编译器下调用0到MAX_LONG次循环中的函数时测量GetTickCount()调用的差异。

Here's what I saw: 这是我看到的:

This took 11497 ticks 这需要11497个刻度

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

While this took 14399 ticks 虽然这需要14399个刻度

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

edit: my idea about why one was faster is wrong because: 编辑:我为什么一个人更快的想法是错误的,因为:

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
        ;
}

This version used only 11107 ticks. 此版本仅使用了11107个刻度。 Since + is faster than - perhaps? 因为+快于 - 也许? I'm not sure. 我不确定。

Even faster though was the binary search at 7161 ticks 更快的是二进制搜索7161个刻度

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

And fastest so far is using the MS intrinsic function, at 4399 ticks 到目前为止最快的是使用MS内在函数,为4399个刻度

#pragma intrinsic(_BitScanReverse)

inline int len2(uint32 val)
{
    DWORD index;
    _BitScanReverse(&index, val);

    return (index>>3)+1;

}

For reference - here's the code i used to profile: 供参考 - 这是我用来描述的代码:

int _tmain(int argc, _TCHAR* argv[])
{
    int j = 0;
    DWORD t1,t2;

    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);


    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len2(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
}

Had to print j to prevent the loops from being optimized out. 必须打印j以防止循环被优化。

Do you really have profile evidence that this is a significant bottleneck in your application? 您是否真的有个人资料证明这是您申请中的重大瓶颈? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. 只是以最明显的方式做到这一点,并且只有当分析显示它是一个问题(我怀疑)时,然后尝试改进。 Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it. 最有可能通过减少对此函数的调用次数而不是通过更改其中的内容来获得最佳改进。

Binary search MIGHT save a few cycles, depending on the processor architecture. 二进制搜索可以节省几个周期,具体取决于处理器架构。

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

Or, finding out which is the most common case might bring down the average number of cycles, if most inputs are one byte (eg when building UTF-8 encodings, but then your break points wouldn't be 32/24/16/8): 或者,找出哪个是最常见的情况可能会降低平均周期数,如果大多数输入是一个字节(例如,当构建UTF-8编码时,但是那时你的断点不会是32/24/16/8 ):

inline int len(uint32 val)
{
    if (val & 0xffffff00) {
       if (val & 0xffff0000) {
           if (val & 0xff000000) return 4;
           return 3;
       }
       return 2;
    }
    return 1;
}

Now the short case does the fewest conditional tests. 现在,短案是最少的条件测试。

If bit ops are faster than comparison on your target machine you can do this: 如果位操作比目标计算机上的比较快,则可以执行以下操作:

inline int len(uint32 val)
{
    if(val & 0xff000000) return 4;
    if(val & 0x00ff0000) return 3;
    if(val & 0x0000ff00) return 2;
    return 1;
} 

You can avoid the conditional branches that can be costly if the distribution of your numbers does not make prediction easy: 如果数字的分布不能使预测变得容易,则可以避免条件分支成本高昂:

return 4 - (val <= 0x000000ff) - (val <= 0x0000ffff) - (val <= 0x00ffffff);

Changing the <= to a & will not change anything much on a modern processor. 更改<=&不会改变任何东西太多上的现代处理器。 What is your target platform? 你的目标平台是什么?

Here is the generated code for x86-64 with gcc -O : 这是使用gcc -O为x86-64生成的代码:

    cmpl    $255, %edi
    setg    %al
    movzbl  %al, %eax
    addl    $3, %eax
    cmpl    $65535, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax
    cmpl    $16777215, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax

There are comparison instructions cmpl of course, but these are followed by setg or setle instead of conditional branches (as would be usual). 当然有比较指令cmpl ,但是后面跟着setgsetle而不是条件分支(通常是这样)。 It's the conditional branch that is expensive on a modern pipelined processor, not the comparison. 这是条件分支,在现代流水线处理器上很昂贵,而不是比较。 So this version saves the expensive conditional branches. 所以这个版本保存了昂贵的条件分支。

My attempt at hand-optimizing gcc's assembly: 我尝试手动优化gcc的程序集:

    cmpl    $255, %edi
    setg    %al
    addb    $3, %al
    cmpl    $65535, %edi
    setle   %dl
    subb    %dl, %al
    cmpl    $16777215, %edi
    setle   %dl
    subb    %dl, %al
    movzbl  %al, %eax

On some systems this could be quicker on some architectures: 在某些系统上,这可能会在某些架构上更快:

inline int len(uint32_t val) {
   return (int)( log(val) / log(256) );  // this is the log base 256 of val
}

This may also be slightly faster (if comparison takes longer than bitwise and): 这也可能稍快一些(如果比较需要比按位更长):

inline int len(uint32_t val) {
    if (val & ~0x00FFffFF) {
        return 4;
    if (val & ~0x0000ffFF) {
        return 3;
    }
    if (val & ~0x000000FF) {
        return 2;
    }
    return 1;

} }

If you are on an 8 bit microcontroller (like an 8051 or AVR) then this will work best: 如果你使用的是8位微控制器(如8051或AVR),那么这将是最好的:

inline int len(uint32_t val) {
    union int_char { 
          uint32_t u;
          uint8_t a[4];
    } x;
    x.u = val; // doing it this way rather than taking the address of val often prevents
               // the compiler from doing dumb things.
    if (x.a[0]) {
        return 4;
    } else if (x.a[1]) {
       return 3;
    ...

EDIT by tristopia: endianness aware version of the last variant 由tristopia编辑:最后一个变体的endianness感知版本

int len(uint32_t val)
{
  union int_char {
        uint32_t u;
        uint8_t a[4];
  } x;
  const uint16_t w = 1;

  x.u = val;
  if( ((uint8_t *)&w)[1]) {   // BIG ENDIAN (Sparc, m68k, ARM, Power)
     if(x.a[0]) return 4;
     if(x.a[1]) return 3;
     if(x.a[2]) return 2;
  }
  else {                      // LITTLE ENDIAN (x86, 8051, ARM)
    if(x.a[3]) return 4;
    if(x.a[2]) return 3;
    if(x.a[1]) return 2;
  }
  return 1;
}

Because of the const, any compiler worth its salt will only generate the code for the right endianness. 由于const,任何值得盐的编译器只会生成正确的字节序的代码。

You may have a more efficient solution depending on your architecture. 根据您的架构,您可能拥有更高效的解决方案。

MIPS has a "CLZ" instruction that counts the number of leading zero-bits of a number. MIPS具有“CLZ”指令,用于计算数字的前导零位数。 What you are looking for here is essentially 4 - (CLZ(x) / 8) (where / is integer division). 你在这里寻找的基本上是4 - (CLZ(x) / 8) (其中/是整数除法)。 PowerPC has the equivalent instruction cntlz , and x86 has BSR . PowerPC具有等效指令cntlz ,x86具有BSR This solution should simplify down to 3-4 instructions (not counting function call overhead) and zero branches. 此解决方案应简化至3-4条指令(不计算函数调用开销)和零分支。

Just to illustrate, based on FredOverflow's answer (which is nice work, kudos and +1), a common pitfall regarding branches on x86. 只是为了说明,基于FredOverflow的答案(这是很好的工作,荣誉和+1),关于x86分支的常见缺陷。 Here's FredOverflow's assembly as output by gcc: 这是FredOverflow的汇编作为gcc的输出:

movl    8(%ebp), %edx   #1/.5
movl    %edx, %eax      #1/.5
andl    $-16777216, %eax#1/.5
cmpl    $1, %eax        #1/.5
sbbl    %eax, %eax      #8/6
addl    $4, %eax        #1/.5
xorl    %ecx, %ecx      #1/.5
testl   $-65536, %edx   #1/.5
sete    %cl             #5
subl    %ecx, %eax      #1/.5
andl    $-256, %edx     #1/.5
sete    %dl             #5
movzbl  %dl, %edx       #1/.5
subl    %edx, %eax      #1/.5
# sum total: 29/21.5 cycles

(the latency, in cycles, is to be read as Prescott/Northwood) (周期中的延迟将被视为Prescott / Northwood)

Pascal Cuoq's hand-optimized assembly (also kudos): Pascal Cuoq手工优化组装(也称赞):

cmpl    $255, %edi      #1/.5
setg    %al             #5
addb    $3, %al         #1/.5
cmpl    $65535, %edi    #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
cmpl    $16777215, %edi #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
movzbl  %al, %eax       #1/.5
# sum total: 22/18.5 cycles

Edit: FredOverflow's solution using __builtin_clz() : 使用__builtin_clz()编辑:FredOverflow的解决方案:

movl 8(%ebp), %eax   #1/.5
popl %ebp            #1.5
orb  $-1, %al        #1/.5
bsrl %eax, %eax      #16/8
sarl $3, %eax        #1/4
addl $1, %eax        #1/.5
ret
# sum total: 20/13.5 cycles

and the gcc assembly for your code: 和代码的gcc程序集:

movl $1, %eax        #1/.5
movl %esp, %ebp      #1/.5
movl 8(%ebp), %edx   #1/.5
cmpl $255, %edx      #1/.5
jbe  .L3             #up to 9 cycles
cmpl $65535, %edx    #1/.5
movb $2, %al         #1/.5
jbe  .L3             #up to 9 cycles
cmpl $16777216, %edx #1/.5
sbbl %eax, %eax      #8/6
addl $4, %eax        #1/.5
.L3:
ret
# sum total: 16/10 cycles - 34/28 cycles

in which the instruction cache line fetches which come as the side-effect of the jcc instructions probably cost nothing for such a short function. 其中指令高速缓存行取出作为jcc指令的副作用可能对于这样的短函数没有任何成本。

Branches can be a reasonable choice, depending on the input distribution. 根据输入分布,分支可能是合理的选择。

Edit: added FredOverflow's solution which is using __builtin_clz() . 编辑:添加了使用__builtin_clz() FredOverflow解决方案。

Ok one more version. 还有一个版本。 Similar to Fred's one, but with less operations. 与弗雷德的相似,但操作较少。

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
    ;
}

This gives you less comparisons. 这样可以减少比较。 But may be less efficient if memory access operation costs more than a couple of comparisons. 但如果内存访问操作的成本高于几个比较,则可能效率较低。

int precalc[1<<16];
int precalchigh[1<<16];
void doprecalc()
{
    for(int i = 0; i < 1<<16; i++) {
        precalc[i] = (i < (1<<8) ? 1 : 2);
        precalchigh[i] = precalc[i] + 2;
    }
}
inline int len(uint32 val)
{
    return (val & 0xffff0000 ? precalchigh[val >> 16] : precalc[val]);
}

The minimum number of bits required to store an integer is: 存储整数所需的最小位数为:

int minbits = (int)ceil( log10(n) / log10(2) ) ;

The number of bytes is: 字节数是:

int minbytes = (int)ceil( log10(n) / log10(2) / 8 ) ;

This is an entirely FPU bound solution, performance may or may not be better than a conditional test, but worth investigation perhaps. 这完全是FPU绑定的解决方案,性能可能会或可能不会比条件测试更好,但也许值得研究。

[EDIT] I did the investigation; [编辑]我做了调查; a simple loop of ten million iterations of the above took 918ms whereas FredOverflow's accepted solution took just 49ms (VC++ 2010). 上面一千万次迭代的简单循环需要918ms,而FredOverflow接受的解决方案只用了49ms(VC ++ 2010)。 So this is not an improvement in terms of performance, though may remain useful if it were the number of bits that were required, and further optimisations are possible. 因此,这不是性能方面的改进,但如果它是所需的位数,则可能仍然有用,并且可以进一步优化。

to Pascal Cuoq and the 35 other people who up-voted his comment: Pascal Cuoq和其他35位投票评论的人:

"Wow! More than 10 million times... You mean that if you squeeze three cycles out of this function, you will save as much as 0.03s? " “哇!超过1000万次......你的意思是,如果你从这个功能中挤出三个周期,你将节省多达0.03秒?”

Such a sarcastic comment is at best rude and offensive. 这种讽刺的评论充其量是粗鲁无礼的。

Optimization is frequently the cumulative result of 3% here, 2% there. 优化通常是3%的累积结果,其中2%。 3% in overall capacity is nothing to be sneezed at. 在整体能力的3%是在没有被轻视。 Suppose this was an almost saturated and unparallelizable stage in a pipe. 假设这是管道中几乎饱和且不可平行的阶段。 Suppose CPU utilization went from 99% to 96%. 假设CPU利用率从99%上升到96%。 Simple queuing theory tells one that such a reduction in CPU utilization would reduce the average queue length by over 75%. 简单排队理论告诉人们,CPU利用率的这种降低会使平均队列长度减少75%以上。 [the qualitative (load divided by 1-load)] [定性(负载除以1负载)]

Such a reduction can frequently make or break a particular hardware configuration as this has feed back effects on memory requirements, caching the queued items, lock convoying, and (horror of horrors should it be a paged system) even paging. 这种减少可能经常造成或破坏特定的硬件配置,因为这会对内存需求产生反馈效应,缓存排队的项目,锁定convoying,以及(如果它是分页系统的恐怖恐怖)甚至是分页。 It is precisely these sorts of effects that cause bifurcated hysteresis loop type system behavior. 正是这些效应导致分叉磁滞回线型系统行为。

Arrival rates of anything seem to tend to go up and field replacement of a particular CPU or buying a faster box is frequently just not an option. 任何东西的到货率似乎都会上升,特定CPU的现场更换或购买更快的盒子通常不是一种选择。

Optimization is not just about wall clock time on a desktop. 优化不仅仅是桌面上的挂钟时间。 Anyone who thinks that it is has much reading to do about the measurement and modelling of computer program behavior. 任何认为对计算机程序行为的测量和建模有很多阅读的人。

Pascal Cuoq owes the original poster an apology. Pascal Cuoq欠原始海报道歉。

If I remember 80x86 asm right, I'd do something like: 如果我记得80x86 asm,我会做类似的事情:

; Assume value in EAX; count goes into ECX
  cmp eax,16777215 ; Carry set if less
  sbb ecx,ecx      ; Load -1 if less, 0 if greater
  cmp eax,65535
  sbb ecx,0        ; Subtract 1 if less; 0 if greater
  cmp eax,255
  sbb ecx,-4       ; Add 3 if less, 4 if greater

Six instructions. 六条指示。 I think the same approach would also work for six instructions on the ARM I use. 我认为相同的方法也适用于我使用的ARM上的六条指令。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有办法让32位C ++编译器遵循16位整数提升规则? - Is there a way to get a 32bit C++ compiler to follow 16bit integer promotion rules? 将 4 个字节扩展为 8 个字节的最快方法(32 位 -&gt; 64 位) - Fastest way to spread 4 bytes into 8 bytes (32bit -> 64bit) 32位整数缩放,无溢出 - 32bit integer scaling with no overrun 从64位整数中提取32位 - extract 32bit from 64bit integer 将浮点型32位变量类型转换为无符号整数32位时,会进行哪些位级别更改? - What bit-level-changes are made while typecasting a float 32bit variable to unsigned integer 32bit? C ++:将64位整数与32位整数进行比较是否安全? - C++: Is it safe to compare a 64bit integer with a 32bit integer? 为什么32位系统上的std :: size_t 4个字节,当无符号长long在32位和64位系统上都是8个字节时? - Why is std::size_t 4 bytes on 32bit systems when unsigned long long is 8 bytes on both 32bit and 64 bit systems? 有效实现64位和32位无符号整数之间的双向映射 - Efficient implementation of bidirectional map between 64bit and 32bit unsigned integers 在 32 位环境中调用 ___tls_get_addr 是否危险? - Is it dangerous, that ___tls_get_addr called in 32bit environment? 当64位机器上的32位整数溢出时会发生什么? - What happens exactly when a 32bit integer overflows on a 64bit machine?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM