简体   繁体   English

使用 intel 内联汇编器编写带进位的 bigint add

[英]Using intel inline assembler to code bigint add with carry

I would like to do a fast code for adding 64 bit numbers in big ints:我想做一个在大整数中添加 64 位数字的快速代码:

uint64_t ans[n];
uint64_t a[n], b[n]; // assume initialized values....
for (int i = 0; i < n; i++)
  ans[i] = a[i] + b[i];

but the above does not work with carry.但以上不适用于carry。

I saw another question here that suggested using an if statement to check which is elegant:我在这里看到另一个问题,建议使用 if 语句来检查哪个是优雅的:

ans[0] = a[0] + b[0];
int c = ans[0] < a[0];
for (int i = 0; i < n; i++) {
  ans[i] = a[i] + b[i] + c;
  c = ans[i] < a[i];
}

However, I would like to learn how to embed inline (intel) assembly and do it faster.但是,我想学习如何嵌入内联(英特尔)程序集并更快地完成。 I am sure there are 64 bit opcodes, the equivalent of:我确定有 64 位操作码,相当于:

add eax, ebx
adc ...

but I don't know how to pass parameters to the assembler from the rest of the c++ code.但我不知道如何从 C++ 代码的其余部分将参数传递给汇编程序。

but the above does not work with carry.但以上不适用于carry。

If you mean that GCC does not generate code that uses the ADC instruction, that's because its optimizer has determined that there is a more optimal way to implement the addition.如果您的意思是 GCC 不生成使用ADC指令的代码,那是因为它的优化器已经确定有一种更优的方式来实现加法。

Here is my test version of your code.这是我的代码的测试版本。 I have extracted the arrays out as parameters passed to the function so that the code cannot be elided and we can limit our study to the relevant portions.我已经将数组提取出来作为传递给函数的参数,这样代码就不会被省略,我们可以将我们的研究限制在相关部分。

void Test(uint64_t* a, uint64_t* b, uint64_t* ans, int n)
{
    for (int i = 0; i < n; ++i)
    {
        ans[i] = a[i] + b[i];
    }
}

Now, indeed, when you compile this with a modern version of GCC and look at the disassembly , you'll see a bunch of crazy-looking code.现在,确实,当您使用现代版本的 GCC 编译它并查看反汇编时,您会看到一堆看起来很疯狂的代码。

The Godbolt compiler explorer is helpful enough that it color-codes lines of C source and their corresponding assembly code (or at least, it does so to the best of its ability; this isn't perfect in optimized code, but it works well enough here). Godbolt 编译器资源管理器非常有用,它可以对 C 源代码行及其相应的汇编代码进行颜色编码(或者至少,它尽其所能;这在优化代码中并不完美,但它运行良好这里)。 The purple code is what implements the 64-bit addition in the inner body of the loop.紫色代码是在循环内部实现 64 位加法的代码。 GCC is emitting SSE2 instructions to do the addition. GCC 正在发出 SSE2 指令来进行添加。 Specifically, you can pick out MOVDQU (which does an unaligned move of a double quadword from memory into an XMM register), PADDQ (which does an addition on packed integer quadwords), and MOVQ (which moves a quadword from an XMM register into memory).具体来说,您可以选择MOVDQU (将双四字未对齐地从内存移动到 XMM 寄存器)、 PADDQ (对压缩整数四字进行加法)和MOVQ (将四字从 XMM 寄存器移动到内存中) )。 Roughly speaking, for a non-assembly expert, MOVDQU is how it loads the 64-bit integer values, PADDQ does the addition, and then MOVQ stores the result.粗略地说,对于非汇编专家, MOVDQU是如何加载 64 位整数值, PADDQ进行加法,然后MOVQ存储结果。

Part of what makes this output especially noisy and confusing is that GCC is unrolling the for loop.使这个输出特别嘈杂和混乱的部分原因是 GCC 正在展开for循环。 If you disable loop unrolling ( -fno-tree-vectorize ), you get output that is easier to read , although it's still doing the same thing using the same instructions.如果您禁用循环展开( -fno-tree-vectorize ),您将获得更易于阅读的输出,尽管它仍然使用相同的指令做同样的事情。 (Well, mostly. Now it's using MOVQ everywhere, for both loads and stores, instead of loading with MOVDQU .) (嗯,主要是。现在它在任何地方都使用MOVQ ,用于加载和存储,而不是使用MOVDQU加载。)

On the other hand, if you specifically forbid the compiler from using SSE2 instructions ( -mno-sse2 ), you see output that is significantly different .另一方面,如果您明确禁止编译器使用 SSE2 指令 ( -mno-sse2 ),您会看到明显不同的输出 Now, because it can't use SSE2 instructions, it's emitting basic x86 instructions to do the 64-bit addition—and the only way to do it is ADD + ADC .现在,因为它不能使用 SSE2 指令,它发出基本的 x86 指令来执行 64 位加法——而唯一的方法是ADD + ADC

I suspect that this is the code you were expecting to see.我怀疑这是您希望看到的代码。 Clearly, GCC believes that vectorizing the operation results in faster code, so this is what it does when you compile with the -O2 or -O3 flags.显然,GCC 相信向量化操作会产生更快的代码,所以这就是使用-O2-O3标志编译时的作用。 At -O1 , it always uses ADD + ADC .-O1 ,它始终使用ADD + ADC This is one of those cases where fewer instructions does not imply faster code.这是指令越少并不意味着代码越快的情况之一。 (Or at least, GCC doesn't think so. Benchmarks on actual code might tell a different story. The overhead may be significant in certain contrived scenarios but irrelevant in the real world.) (或者至少,GCC 不这么认为。实际代码的基准测试可能会讲述一个不同的故事。开销在某些人为场景中可能很重要,但在现实世界中无关紧要。)

For what it's worth, Clang behaves in very similar ways as GCC does here.就其价值而言,Clang 的行为方式与 GCC 在这里的行为方式非常相似。


If you meant that this code doesn't carry the result of the previous addition over to the next addition, then you're right.如果您的意思是此代码不会将前一个加法的结果传递到下一个加法,那么您是对的。 The second snippet of code that you've shown implements that algorithm, and GCC does compile this using the ADC instruction .您展示的第二段代码实现了该算法, GCC 确实使用ADC指令对其进行了编译

At least, it does when targeting x86-32.至少,它在面向 x86-32 时确实如此。 When targeting x86-64, where you have native 64-bit integer registers available, no "carrying" is even necessary;当以 x86-64 为目标时,您有本地 64 位整数寄存器可用,甚至不需要“携带”; simple ADD instructions are sufficient , requiring significantly less code.简单的ADD指令就足够了,需要的代码显着减少。 In fact, this is only "bigint" arithmetic on 32-bit architectures, which is why I have assumed x86-32 in all of the foregoing analysis and compiler output.事实上,这只是 32 位体系结构上的“bigint”算法,这就是为什么我在所有上述分析和编译器输出中都假设 x86-32。

In a comment, Ped7g wonders why compilers don't seem to have the idea of the ADD + ADC chain idiom.在评论中,Ped7g 想知道为什么编译器似乎没有ADD + ADC链习语的想法。 I'm not entirely sure what he's referring to here, since he didn't share any examples of the input code that he tried, but as I've shown, compilers do use ADC instructions here.我不完全确定他在这里指的是什么,因为他没有分享他尝试过的输入代码的任何示例,但正如我所展示的,编译器在这里确实使用了ADC指令。 However, compilers don't chain carries across loop iterations.但是,编译器不会跨循环迭代链进位。 This is too difficult to implement in practice, because so many instructions clear the flags.这在实践中太难实现,因为太多指令清除标志。 Someone hand-writing the assembly code might be able to do it, but compilers won't.手工编写汇编代码的人可能能够做到,但编译器不会。

(Note that c should probably be an unsigned integer to encourage certain optimizations. In this case, it just ensures that GCC uses an XOR instruction when preparing to do a 64-bit addition instead of a CDQ . Although slightly faster, not a huge improvement, but mileage may vary with real code.) (请注意, c可能应该是一个无符号整数以鼓励某些优化。在这种情况下,它只是确保 GCC 在准备执行 64 位加法而不是CDQ时使用XOR指令。虽然稍微快一点,但不是很大的改进,但里程可能因实际代码而异。)

(Also, it's disappointing that GCC is unable to emit branchless code for setting c inside of the loop. With sufficiently random input values, branch prediction will fail, and you'll end up with relatively inefficient code. There are almost certainly ways of writing the C source to persuade GCC to emit branchless code, but that's an entirely different answer.) (此外,令人失望的是 GCC 无法发出用于在循环内设置c无分支代码。如果输入值足够随机,分支预测将失败,并且您最终会得到相对低效的代码。几乎可以肯定有编写方法说服 GCC 发出无分支代码的 C 源代码,但这是一个完全不同的答案。)


I would like to learn how to embed inline (intel) assembly and do it faster.我想学习如何嵌入内联(英特尔)程序集并更快地完成。

Well, we've already seen that it might not necessarily be faster if you naïvely caused a bunch of ADC instructions to be emitted.好吧,我们已经看到,如果您天真地发出一堆ADC指令,它可能不一定会更快。 Don't hand-optimize unless you are confident that your assumptions about performance are correct!除非您确信自己对性能的假设是正确的,否则不要手动优化!

Also, not only is inline assembly difficult to write, debug, and maintain, but it may even make your code slower because it inhibits certain optimizations that could otherwise have been done by the compiler.此外,内联汇编不仅难以编写、调试和维护,而且甚至可能使您的代码变慢,因为它禁止某些本来可以由编译器完成的优化。 You need to be able to prove that your hand-coded assembly is a significant enough performance win over what the compiler would generate that these considerations become less relevant.您需要能够证明您的手工编码程序集在性能方面足以胜过编译器生成的内容,从而使这些考虑变得不那么重要。 You also should confirm that there is no way to get the compiler to generate code that is close to your ideal output, either by altering flags or cleverly writing the C source.您还应该确认,无论是通过更改标志还是巧妙地编写 C 源代码,都无法让编译器生成接近您理想输出的代码。

But if you really wanted to , you could read any of a variety of online tutorials that teach you how to use GCC's inline assembler.但是,如果您真的想要,您可以阅读各种在线教程中的任何一个,这些教程教您如何使用 GCC 的内联汇编器。 This is a pretty good one ;这是一个很好的 there are plenty of others.还有很多其他的。 And of course, there is the manual .当然,还有手册 All will explain how "extended asm" allows you to specify input operands and output operands, which will answer your question of "how to pass parameters to the assembler from the rest of the c++ code".所有人都将解释“扩展 asm”如何允许您指定输入操作数和输出操作数,这将回答您的“如何将参数从 C++ 代码的其余部分传递给汇编程序”的问题。

As paddy and Christopher Oicles suggested, you should prefer intrinsics to inline assembly.正如 paddy 和 Christopher Oicles 所建议的那样,您应该更喜欢内联汇编而不是内联汇编。 Unfortunately, there are no intrinsics that cause ADC instructions to be emitted.不幸的是,没有导致ADC指令被发出的内在函数。 Inline assembly is your only recourse there—that, or what I already suggested of writing the C source so that the compiler will do the Right Thing™ on its own.内联汇编是你唯一的资源——那个,或者我已经建议编写 C 源代码以便编译器自己做正确的事情™。

There are _addcarry_u32 and _addcarry_u64 intrinsics , though.不过,有_addcarry_u32_addcarry_u64内在函数 These cause ADCX or ADOX instructions to be emitted.这些会导致ADCXADOX指令。 These are "extended" versions of ADC that can produce more efficient code. 这些是ADC “扩展”版本,可以生成更高效的代码。 They are part of the Intel ADX instruction set, introduced with the Broadwell microarchitecture.它们是随 Broadwell 微体系结构引入的 Intel ADX 指令集的一部分。 In my opinion, Broadwell does not have sufficiently high market penetration that you can simply emit ADCX or ADOX instructions and call it a day.在我看来,Broadwell 没有足够高的市场渗透率,您可以简单地发出ADCXADOX指令并收工。 Lots of users still have older machines, and it's in your interest to support them to the extent possible.许多用户仍然拥有较旧的机器,尽可能支持它们符合您的利益。 They're great if you're preparing builds tuned for specific architectures, but I would not recommend it for general use.如果您正在准备针对特定架构调整的构建,它们会很棒,但我不建议将其用于一般用途。


I am sure there are 64 bit opcodes, the equivalent of: add + adc我确定有 64 位操作码,相当于: add + adc

There are 64-bit versions of ADD and ADC (and ADCX and ADOX ) when you're targeting a 64-bit architecture.当您针对 64 位架构时,有 64 位版本的ADDADC (以及ADCXADOX )。 This would then allow you to implement 128-bit "bigint" arithmetic using the same pattern.这将允许您使用相同的模式实现 128 位“bigint”算法。

On x86-32, there are no 64-bit versions of these instructions in the base instruction set.在 x86-32 上,基本指令集中没有这些指令的 64 位版本。 You must turn to SSE2, like we saw GCC and Clang do.您必须转向 SSE2,就像我们看到 GCC 和 Clang 所做的那样。

I'm not entirely sure if this is what you were looking for, and my assembly skills are definitely not the best (lack of suffixes for example), but this uses ADC and should solve your problem.我不完全确定这是否是您要找的东西,而且我的组装技能绝对不是最好的(例如缺少后缀),但这使用ADC并且应该可以解决您的问题。

Note the omission of the C++ for loop;注意 C++ for 循环的省略; we need to loop in asm because we need CF to survive across iterations.我们需要在 asm 中循环,因为我们需要CF在迭代中生存。 (GCC6 has flag output constraints, but not flag inputs; there's no way to ask the compiler to pass FLAGS from one asm statement to another, and gcc would probably do it inefficiently with setc/cmp even if there was syntax for it.) (GCC6 有标志输出约束,但没有标志输入;没有办法要求编译器将 FLAGS 从一个 asm 语句传递到另一个,即使有语法,gcc 也可能会用 setc/cmp 效率低下。)

#include <cstdint>
#include <iostream>

#define N 4

int main(int argc, char *argv[]) {

  uint64_t ans[N];
  const uint64_t a[N] = {UINT64_MAX, UINT64_MAX, 0, 0};
  const uint64_t b[N] = {2, 1, 3, 1};

  const uint64_t i = N;
  asm volatile (
      "xor %%eax, %%eax\n\t"      // i=0  and clear CF
      "mov %3, %%rdi\n\t"         // N

      ".L_loop:\n\t"

      "mov (%%rax,%1), %%rdx\n\t" // rdx = a[i]

      "adc (%%rax,%2), %%rdx\n\t" // rdx += b[i] + carry

      "mov %%rdx, (%%rax, %0)\n\t"// ans[i] = a[i] + b[i]

      "lea 8(%%rax), %%rax\n\t"   // i += 8 bytes

      "dec %%rdi\n\t"             // --i

      "jnz .L_loop\n\t"   // if (rdi == 0) goto .L_loop;
      : /* Outputs (none) */
      : /* Inputs */ "r" (ans), "r" (a), "r" (b), "r" (i)
      : /* Clobbered */ "%rax", "%rbx", "%rdx", "%rdi", "memory"
  );

  // SHOULD OUTPUT 1 1 4 1
  for (int i = 0; i < N; ++i)
    std::cout << ans[i] << std::endl;

  return 0;
}

In order to avoid setting the carry flag (CF) , I needed to count down to 0 in order to avoid doing a CMP .为了避免设置carry flag (CF) ,我需要倒计时到 0 以避免执行CMP DEC does not set the carry flag , so it may be the perfect contender for this application. DEC不设置carry flag ,因此它可能是此应用程序的完美竞争者。 However, I don't know how to index from the beginning of arrays any faster using %rdi than the extra instruction and register needed for inc %rax .但是,我不知道如何使用%rdiinc %rax所需的额外指令和寄存器更快地从数组的开头索引。

The volatile and "memory" clobber are necessary because we only ask the compiler for pointer inputs, and don't tell it which memory we actually read and write. volatile"memory" clobber 是必要的,因为我们只要求编译器提供指针输入,而不告诉它我们实际读取和写入的内存。

On some older CPUs, notably Core2 / Nehalem, adc after inc will cause a partial-flag stall .在一些较旧的 CPU 上,特别是 Core2 / Nehalem,在inc之后的adc会导致部分标志停顿 See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs .请参阅某些 CPU 上紧密循环中的 ADC/SBB 和 INC/DEC 问题 But on modern CPUs, this is efficient.但在现代 CPU 上,这是有效的。

EDIT: As pointed out by @PeterCordes , my inc %rax and scaling by 8 with lea was horribly inefficient (and stupid now that I think about it).编辑:正如@PeterCordes所指出的,我的inc %rax和使用 lea 缩放 8 的效率非常低(现在我想到它很愚蠢)。 Now, it is simply lea 8(%rax), %rax .现在,它只是lea 8(%rax), %rax


Editor's note: we can save another instruction by using a negative index from the end of the array, counting up toward 0 with inc / jnz .编者注:我们可以通过使用数组末尾的负索引来保存另一条指令,使用inc / jnz向 0 计数。

(This hard-codes the array size at 4. You could maybe make this more flexible by asking for the array length as an immediate constant, and -i as an input. Or asking for pointers to the end.) (这将数组大小硬编码为 4。您可以通过要求将数组长度作为立即常量,并将-i作为输入来使其更加灵活。或者要求指向末尾的指针。)

// untested
  asm volatile (
      "mov   $-3, %[idx]\n\t"        // i=-3   (which we will scale by 8)

      "mov   (%[a]), %%rdx  \n\t"
      "add   (%[b]), %%rdx  \n\t"    // peel the first iteration so we don't have to zero CF first, and ADD is faster on some CPUs.
      "mov    %%rdx, (%0) \n\t"

      ".L_loop:\n\t"                        // do{
      "mov    8*4(%[a], %[idx], 8), %%rdx\n\t"   // rdx = a[i + len]
      "adc    8*4(%[b], %[idx], 8), %%rdx\n\t"   // rdx += b[i + len] + carry
      "mov    %%rdx,  8*4(%[ans], %[idx], 8)\n\t"  // ans[i] = rdx

      "inc    %[idx]\n\t"
      "jnz    .L_loop\n\t"                  // }while (++i);

      : /* Outputs, actually a read-write input */ [idx] "+&r" (i)
      : /* Inputs */ [ans] "r" (ans), [a] "r" (a), [b] "r" (b)
      : /* Clobbered */ "rdx", "memory"
  );

The loop label should probably use %%= in case GCC duplicates this code, or use a numbered local label like 1:循环标签可能应该使用%%=以防 GCC 复制此代码,或者使用编号的本地标签,如1:

Using a scaled-index addressing mode is no more expensive than a regular indexed addressing mode (2 registers) like we were using before.使用缩放索引寻址模式并不比我们以前使用的常规索引寻址模式(2 个寄存器)昂贵。 Ideally we'd use a one-register addressing mode for either the adc or the store, maybe indexing the other two arrays relative to ans , by subtracting the pointers on input.理想情况下,我们将对adc或 store 使用单寄存器寻址模式,可能通过减去输入上的指针来索引其他两个相对于ans数组。

But then we'd need a separate LEA to increment by 8, because we still need to avoid destroying CF.但是我们需要一个单独的 LEA 来增加 8,因为我们仍然需要避免破坏 CF。 Still, on Haswell and later, indexed stores can't use the AGU on port 7, and Sandybridge/Ivybridge they un-laminate to 2 uops.尽管如此,在 Haswell 和更高版本上,索引商店不能在端口 7 上使用 AGU,而 Sandybridge/Ivybridge 将它们取消层压到 2 uop。 So for Intel SnB-family, avoiding an indexed store here would be good because we need 2x load + 1x store per iteration.因此,对于英特尔 SnB 系列,避免在此处使用索引存储会很好,因为我们每次迭代需要 2x 负载 + 1x 存储。 See Micro fusion and addressing modes请参阅微融合和寻址模式

Earlier Intel CPUs (Core2 / Nehalem) will have partial-flag stalls on the above loop, so the above issues are irrelevant for them.早期的 Intel CPU(Core2 / Nehalem)在上述循环中会有部分标志停顿,因此上述问题与他们无关。

AMD CPUs are probably fine with the above loop. AMD CPU 可能适用于上述循环。 Agner Fog's optimization and microarch guides don't mention any serious problems. Agner Fog 的优化和微架构指南没有提到任何严重的问题。

Unrolling a bit wouldn't hurt, though, for AMD or Intel.不过,对于 AMD 或英特尔来说,展开一点不会有什么坏处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM