简体   繁体   English

无论结果是什么,支持除零的最快整数除法是什么?

[英]What is the fastest integer division supporting division by zero no matter what the result is?

Summary: 摘要:

I'm looking for the fastest way to calculate 我正在寻找最快的计算方法

(int) x / (int) y

without getting an exception for y==0 . 没有得到y==0的例外。 Instead I just want an arbitrary result. 相反,我只想要一个任意的结果。


Background: 背景:

When coding image processing algorithms I often need to divide by an (accumulated) alpha value. 在编码图像处理算法时,我经常需要除以(累积的)α值。 The most simple variant is plain C code with integer arithmetic. 最简单的变体是带有整数运算的普通C代码。 My problem is that I typically get a division by zero error for result pixels with alpha==0 . 我的问题是,对于alpha==0结果像素,我通常会得到零误差除法。 However this are exactly the pixels where the result doesn't matter at all: I don't care about color values of pixels with alpha==0 . 然而,这正是结果无关紧要的像素:我不关心alpha==0的像素的颜色值。


Details: 细节:

I'm looking for something like: 我正在寻找类似的东西:

result = (y==0)? 0 : x/y;

or 要么

result = x / MAX( y, 1 );

x and y are positive integers. x和y是正整数。 The code is executed a huge number of times in a nested loop, so I'm looking for a way to get rid of the conditional branching. 代码在嵌套循环中执行了很多次,所以我正在寻找一种摆脱条件分支的方法。

When y does not exceed the byte range, I'm happy with the solution 当y不超过字节范围时,我对解决方案感到满意

unsigned char kill_zero_table[256] = { 1, 1, 2, 3, 4, 5, 6, 7, [...] 255 };
[...]
result = x / kill_zero_table[y];

But this obviously does not work well for bigger ranges. 但这显然不适用于更大的范围。

I guess the final question is: Whats the fastest bit twiddling hack changing 0 to any other integer value, while leaving all other values unchanged? 我想最后一个问题是:什么是最快的位,将hack改为0到任何其他整数值,同时保持所有其他值不变?


Clarifications 澄清

I'm not 100% sure that branching is too expensive. 我不是100%确定分支太贵了。 However, different compilers are used, so I prefer benchmarking with little optimizations (which is indeed questionable). 但是,使用了不同的编译器,所以我更喜欢基准测试而几乎没有优化(这确实值得怀疑)。

For sure, compilers are great when it comes to bit twiddling, but I can't express the "don't care" result in C, so the compiler will never be able to use the full range of optimizations. 当然,编译器很有用,但是我不能在C中表达“不关心”的结果,因此编译器永远无法使用全范围的优化。

Code should be fully C compatible, main platforms are Linux 64 Bit with gcc & clang and MacOS. 代码应完全兼容C,主要平台是带有gcc&clang和MacOS的Linux 64位。

Inspired by some of the comments I got rid of the branch on my Pentium and gcc compiler using 受到一些评论的启发,我摆脱了奔腾和gcc编译器使用的分支

int f (int x, int y)
{
        y += y == 0;
        return x/y;
}

The compiler basically recognizes that it can use a condition flag of the test in the addition. 编译器基本上认识到它可以在添加中使用测试的条件标志。

As per request the assembly: 根据要求组装:

.globl f
    .type   f, @function
f:
    pushl   %ebp
    xorl    %eax, %eax
    movl    %esp, %ebp
    movl    12(%ebp), %edx
    testl   %edx, %edx
    sete    %al
    addl    %edx, %eax
    movl    8(%ebp), %edx
    movl    %eax, %ecx
    popl    %ebp
    movl    %edx, %eax
    sarl    $31, %edx
    idivl   %ecx
    ret

As this turned out to be such a popular question and answer, I'll elaborate a bit more. 由于这是一个如此流行的问题和答案,我将详细说明。 The above example is based on programming idiom that a compiler recognizes. 上面的示例基于编译器识别的编程习惯。 In the above case a boolean expression is used in integral arithmetic and the use of condition flags are invented in hardware for this purpose. 在上面的例子中,布尔表达式用于积分算术,并且条件标志的使用是为此目的在硬件中发明的。 In general condition flags are only accessible in C through using idiom. 通常,条件标志只能通过使用习语在C中访问。 That is why it so hard to make a portable multiple precision integer library in C without resorting to (inline) assembly. 这就是为什么很难在C中制作一个可移植的多精度整数库而不采用(内联)汇编。 My guess is that most decent compilers will understand the above idiom. 我的猜测是,大多数体面的编译器都会理解上面的习语。

Another way of avoiding branches, as also remarked in some of the above comments, is predicated execution. 避免分支的另一种方法,如上面的一些评论中所述,是谓词执行。 I therefore took philipp's first code and my code and ran it through the compiler from ARM and the GCC compiler for the ARM architecture, which features predicated execution. 因此,我接受了philipp的第一个代码和我的代码,并通过ARM的编译器和ARM体系结构的GCC编译器运行它,该体系结构具有谓词执行功能。 Both compilers avoid the branch in both samples of code: 两个编译器都避免了两个代码示例中的分支:

Philipp's version with the ARM compiler: Philipp的ARM版本编译器:

f PROC
        CMP      r1,#0
        BNE      __aeabi_idivmod
        MOVEQ    r0,#0
        BX       lr

Philipp's version with GCC: Philipp与GCC的版本:

f:
        subs    r3, r1, #0
        str     lr, [sp, #-4]!
        moveq   r0, r3
        ldreq   pc, [sp], #4
        bl      __divsi3
        ldr     pc, [sp], #4

My code with the ARM compiler: 我的代码与ARM编译器:

f PROC
        RSBS     r2,r1,#1
        MOVCC    r2,#0
        ADD      r1,r1,r2
        B        __aeabi_idivmod

My code with GCC: 我在GCC的代码:

f:
        str     lr, [sp, #-4]!
        cmp     r1, #0
        addeq   r1, r1, #1
        bl      __divsi3
        ldr     pc, [sp], #4

All versions still need a branch to the division routine, because this version of the ARM doesn't have hardware for a division, but the test for y == 0 is fully implemented through predicated execution. 所有版本仍需要分区例程的分支,因为此版本的ARM没有用于除法的硬件,但y == 0的测试通过谓词执行完全实现。

Here are some concrete numbers, on Windows using GCC 4.7.2: 以下是一些具体的数字,在Windows上使用GCC 4.7.2:

#include <stdio.h>
#include <stdlib.h>

int main()
{
  unsigned int result = 0;
  for (int n = -500000000; n != 500000000; n++)
  {
    int d = -1;
    for (int i = 0; i != ITERATIONS; i++)
      d &= rand();

#if CHECK == 0
    if (d == 0) result++;
#elif CHECK == 1
    result += n / d;
#elif CHECK == 2
    result += n / (d + !d);
#elif CHECK == 3
    result += d == 0 ? 0 : n / d;
#elif CHECK == 4
    result += d == 0 ? 1 : n / d;
#elif CHECK == 5
    if (d != 0) result += n / d;
#endif
  }
  printf("%u\n", result);
}

Note that I am intentionally not calling srand() , so that rand() always returns exactly the same results. 请注意,我故意不调用srand() ,因此rand()始终返回完全相同的结果。 Note also that -DCHECK=0 merely counts the zeroes, so that it is obvious how often appeared. 还要注意-DCHECK=0只计算零,因此很明显经常出现这种情况。

Now, compiling and timing it various ways: 现在,以各种方式编译和计时:

$ for it in 0 1 2 3 4 5; do for ch in 0 1 2 3 4 5; do gcc test.cc -o test -O -DITERATIONS=$it -DCHECK=$ch && { time=`time ./test`; echo "Iterations $it, check $ch: exit status $?, output $time"; }; done; done

shows output that can be summarised in a table: 显示可以在表中汇总的输出:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.612s | -        | -        | -         | -         | -
Check 2      | 0m0.612s | 0m6.527s | 0m9.718s | 0m13.464s | 0m18.422s | 0m22.871s
Check 3      | 0m0.616s | 0m5.601s | 0m8.954s | 0m13.211s | 0m19.579s | 0m25.389s
Check 4      | 0m0.611s | 0m5.570s | 0m9.030s | 0m13.544s | 0m19.393s | 0m25.081s
Check 5      | 0m0.612s | 0m5.627s | 0m9.322s | 0m14.218s | 0m19.576s | 0m25.443s

If zeroes are rare, the -DCHECK=2 version performs badly. 如果零是罕见的,则-DCHECK=2版本执行得很糟糕。 As zeroes start appearing more, the -DCHECK=2 case starts performing significantly better. 随着零开始出现更多, -DCHECK=2情况开始表现得更好。 Out of the other options, there really isn't much difference. 在其他选项中,确实没有太大区别。

For -O3 , though, it is a different story: 但是对于-O3来说,这是一个不同的故事:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.646s | -        | -        | -         | -         | -
Check 2      | 0m0.654s | 0m5.670s | 0m9.905s | 0m14.238s | 0m17.520s | 0m22.101s
Check 3      | 0m0.647s | 0m5.611s | 0m9.085s | 0m13.626s | 0m18.679s | 0m25.513s
Check 4      | 0m0.649s | 0m5.381s | 0m9.117s | 0m13.692s | 0m18.878s | 0m25.354s
Check 5      | 0m0.649s | 0m6.178s | 0m9.032s | 0m13.783s | 0m18.593s | 0m25.377s

There, check 2 has no drawback compared the other checks, and it does keep the benefits as zeroes become more common. 在那里,检查2与其他检查相比没有任何缺点,并且它确实保留了零作为更常见的好处。

You should really measure to see what happens with your compiler and your representative sample data, though. 不过,您应该真正测量一下编译器和代表性样本数据会发生什么。

Without knowing the platform there is no way to know the exact most efficient method, however, on a generic system this may close to the optimum (using Intel assembler syntax): 在不了解平台的情况下,无法知道确切最有效的方法,但是,在通用系统上,这可能接近最优(使用英特尔汇编语法):

(assume divisor is in ecx and the dividend is in eax ) (假设除数在ecx ,且股息在eax

mov ebx, ecx
neg ebx
sbb ebx, ebx
add ecx, ebx
div eax, ecx

Four unbranched, single-cycle instructions plus the divide. 四个不分支的单周期指令加上除法。 The quotient will be in eax and the remainder will be in edx at the end. 商将在eax ,其余的将在最后的edx中。 (This kind of shows why you don't want to send a compiler to do a man's job). (这种方式说明了为什么你不想发送编译器来完成一个人的工作)。

According to this link , you can just block the SIGFPE signal with sigaction() (I have not tried it myself, but I believe it should work). 根据这个链接 ,你可以用sigaction()阻止SIGFPE信号(我自己没有尝试过,但我相信它应该可行)。

This is the fastest possible approach if divide by zero errors are extremely rare: you only pay for the divisions by zero, not for the valid divisions, the normal execution path is not changed at all. 如果除以零的错误非常罕见,这是最快的方法:您只需将除以0的除法,而不是有效除法,正常的执行路径根本不会改变。

However, the OS will be involved in every exception that's ignored, which is expensive. 但是,操作系统将参与每个被忽略的异常,这很昂贵。 I think, you should have at least a thousand good divisions per division by zero that you ignore. 我想,你应该忽略每个师至少有一千个好的师。 If exceptions are more frequent than that, you'll likely pay more by ignoring the exceptions than by checking every value before the division. 如果异常比这更频繁,那么通过忽略异常而不是通过在分割之前检查每个值来支付更多费用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM