x86上的两个128位整数的高效乘法/除法（无64位）

Question

Compiler: MinGW/GCC 编译器： MinGW / GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented). 问题：不允许使用GPL / LGPL代码（GMP或任何bignum库，对于这个问题来说是过度的，因为我已经实现了这个类）。

I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching. 我构建了自己的128位固定大小的整数类（用于游戏引擎，但可以推广到任何使用情况），我发现当前的乘法和除法运算的性能非常糟糕（是的，我有时间，见下文）， 我想改进（或改变）执行低级数字运算的算法。

When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class. 当涉及乘法和除法运算符时，与几乎所有其他类似的运算符相比，它们是无法忍受的。

These are the approximate measurements relative to my own computer: 这些是相对于我自己的计算机的近似测量：

Raw times as defined by QueryPerformanceFrequency:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~546u
Division:           ~4760u (with maximum bit count)

As you can see, just doing the multiplication is many, many times slower than add or subtract. 正如您所看到的，只是进行乘法比加或减慢很多倍。 Division is about 10 times slower than multiplication. 除法比乘法慢10倍。

I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc). 我想提高这两个运算符的速度，因为每帧可能会进行非常多的计算（点积，各种碰撞检测方法等）。

The structure (methods omitted) looks somewhat like: 结构（方法省略）看起来有点像：

class uint128_t
{
    public:
        unsigned long int dw3, dw2, dw1, dw0;
  //...
}

Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX output) while ignoring the words that go out of range (that is, I'm only doing 10 mull 's compared to 16). 乘法目前使用典型的长乘法方法（在汇编EDX我可以捕获EDX输出）同时忽略超出范围的单词（也就是说，我只做了10次mull不是16次）。

Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). 除法使用移位 - 减法算法（速度取决于操作数的位数）。 However, it is not done in assembly. 但是，它不是在装配中完成的。 I found that a little too difficult to muster and decided to let the compiler optimize it. 我发现有点太难以集合并决定让编译器优化它。

I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication , high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. 我已经谷歌了几天看着描述算法的页面，例如Karatsuba乘法，高基数除法和牛顿拉普森分部，但数学符号有点太过分了。 I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first. 我想使用其中一些高级方法来加速我的代码，但我必须首先将“希腊语”翻译成可理解的东西。

For those that may deem my efforts "premature optimization"; 对于那些可能认为我的努力“过早优化”的人; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. 我认为这个代码是一个瓶颈，因为非常基本的数学运算本身变得很慢。 I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter. 我可以在更高级别的代码上忽略这种类型的优化，但是这个代码将被调用/使用到足够重要。

I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated. 我想建议我应该使用哪种算法来改进乘法和除法（如果可能的话），以及关于建议算法如何工作的基本（希望易于理解）的解释将受到高度赞赏。

EDIT: Multiply Improvements 编辑：乘以改进

I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible. 我能够通过将代码内联到operator * =来改进乘法运算，并且它似乎尽可能快。

Updated raw times:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~100u (lowest ~86u, highest around ~256u)
Division:           ~4760u (with maximum bit count)

Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity): 这里有一些简单的代码供你检查（注意我的类型名称实际上是不同的，为简单起见，这是编辑的）：

//File: "int128_t.h"
class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            \n\
        movl       %%eax,   %%ebp                   \n\
        movl       $0x00,   %%ebx                   \n\
        movl       $0x00,   %%ecx                   \n\
        movl       $0x00,   %%esi                   \n\
        movl       $0x00,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ebx                   \n\
        adcl       %%edx,   %%ecx                   \n\
        adcl       $0x00,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  \n\
        mull               (%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        pop        %%ebp                            \n"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

As for division, examining the code is rather pointless, as I will need to change the mathematical algorithm to see any substantial benefits. 至于除法，检查代码是没有意义的，因为我需要改变数学算法以看到任何实质性的好处。 The only feasible choice seems to be high-radix division, but I have yet to iron out (in my mind) just how it will work. 唯一可行的选择似乎是高基数除法，但我还没有解决（在我看来）它是如何工作的。

Answer 1

I wouldn't worry much about multiplication. 我不会太担心乘法。 What you're doing seems quite efficient. 你正在做什么似乎非常有效。 I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with. 我并没有真正遵循Karatsuba乘法中的希腊语，但我的感觉是，只有比你处理的数字更大的数字才会更有效率。

One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. 我的一个建议是尝试使用最小的内联汇编块，而不是在汇编中编写逻辑。 You could write a function: 你可以写一个函数：

struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);

The function would be implemented in inline assembly, and you'll call it from C++ code. 该函数将在内联汇编中实现，您将从C ++代码中调用它。 It should be as efficient as pure assembly, and much easier to code. 它应该像纯组件一样高效，并且更容易编码。

About division, I don't know. 关于师，我不知道。 Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits. 我看到的大多数算法都谈到了渐近效率，这可能意味着它们仅对非常高的位数有效。

Answer 2

Do I understand your data correctly that you are running your test on a 1.8 GHz machine and the "u" in your timings are processor cycles? 我是否正确了解您在1.8 GHz计算机上运行测试的数据，并且您的计时中的“u”是处理器周期？

If so, 546 cycles for 10 32x32 bit MULs seem a bit slow to me. 如果是这样，10个32x32位MUL的546个周期对我来说似乎有点慢。 I have my own brand of bignums here on a 2GHz Core2 Duo and a 128x128=256 bit MUL runs in about 150 cycles (I do all 16 small MULs), ie about 6 times faster. 我在2GHz Core2 Duo上拥有自己的品牌bignums，在大约150个周期内运行128x128 = 256位MUL（我做了所有16个小型MUL），即大约快6倍。 But that could be simply a faster CPU. 但这可能只是一个更快的CPU。

Make sure you unroll the loops to save that overhead. 确保您展开循环以节省开销。 Do as little register saving as is needed. 尽可能少注册保存。 Maybe it helps if you post the ASM code here, so we can review it. 如果您在此处发布ASM代码可能会有所帮助，因此我们可以对其进行审核。

Karatsuba will not help you, since it starts to be efficient only from some 20-40 32-bit words on. Karatsuba不会帮助你，因为它开始只有大约20-40个32位字的效率。

Division is always much more expensive than multiplication. 除了乘法之外，除法总是要贵得多。 If you devide by a constant or by the same value many times, it might help to pre-compute the reciprocal and then multiply with it. 如果您多次使用常数或相同的值，则可能有助于预先计算倒数，然后乘以它。

x86上的两个128位整数的高效乘法/除法（无64位）

问题描述

EDIT: Multiply Improvements 编辑：乘以改进

2 个解决方案

解决方案1
2 2012-01-08 08:01:30

解决方案2
1 2012-01-16 21:17:50

x86上的两个128位整数的高效乘法/除法（无64位）

问题描述

EDIT: Multiply Improvements 编辑：乘以改进

2 个解决方案

解决方案1 2 2012-01-08 08:01:30

解决方案2 1 2012-01-16 21:17:50

解决方案1
2 2012-01-08 08:01:30

解决方案2
1 2012-01-16 21:17:50