简体   繁体   English

如何使用标准 C++ 中的计算 goto 将动态调度速度提高 20%

[英]How to speed up dynamic dispatch by 20% using computed gotos in standard C++

Before you down-vote or start saying that goto ing is evil and obsolete, please read the justification of why it is viable in this case.在你投反对票或开始说goto是邪恶的和过时的之前,请阅读为什么它在这种情况下可行的理由。 Before you mark it as duplicate, please read the full question.在将其标记为重复之前,请阅读完整的问题。

I was reading about virtual machine interpreters, when I stumbled across computed gotos .我偶然发现计算的 gotos时,我正在阅读有关虚拟机解释器的信息。 Apparently they allow significant performance improvement of certain pieces of code.显然,它们可以显着提高某些代码的性能。 The most known example is the main VM interpreter loop.最著名的例子是主 VM 解释器循环。

Consider a (very) simple VM like this:考虑一个(非常)简单的虚拟机,如下所示:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int main()
{
    Opcode program[] = { // an example program that returns 10
        Opcode::INC,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::INC,
        Opcode::INC,
        Opcode::RET
    };
    
    int result = 0;

    for (Opcode instruction : program)
    {
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        }
    }
}

All this VM can do is a few simple operations on one number of type int and print it.这个虚拟机所能做的就是对一个int类型进行一些简单的操作并打印它。 In spite of its doubtable usefullness, it illustrates the subject nonetheless.尽管它的有用性值得怀疑,但它仍然说明了这个主题。

The critical part of the VM is obviously the switch statement in the for loop. VM 的关键部分显然是for循环中的switch语句。 Its performance is determined by many factors, of which the most inportant ones are most certainly branch prediction and the action of jumping to the appropriate point of execution (the case labels).它的性能由许多因素决定,其中最重要的肯定是分支预测和跳转到适当执行点( case标签)的动作。

There is room for optimization here.这里有优化的空间。 In order to speed up the execution of this loop, one might use, so called, computed gotos .为了加快这个循环的执行速度,可以使用所谓的计算 gotos

Computed Gotos计算的 Goto

Computed gotos are a construct well known to Fortran programmers and those using a certain (non-standard) GCC extension.计算 goto 是 Fortran 程序员和使用特定(非标准)GCC 扩展的程序员所熟知的构造。 I do not endorse the use of any non-standard, implementation-defined, and (obviously) undefined behavior.我不赞成使用任何非标准的、实现定义的和(显然)未定义的行为。 However to illustrate the concept in question, I will use the syntax of the mentioned GCC extension.然而,为了说明有问题的概念,我将使用提到的 GCC 扩展的语法。

In standard C++ we are allowed to define labels that can later be jumped to by a goto statement:在标准 C++ 中,我们可以定义稍后可以通过goto语句跳转到的标签:

goto some_label;

some_label:
    do_something();

Doing this isn't considered good code ( and for a good reason! ).这样做不被认为是好的代码( 并且有充分的理由! )。 Although there are good arguments against using goto (of which most are related to code maintainability) there is an application for this abominated feature.尽管有很好的 arguments 反对使用goto (其中大多数与代码可维护性有关),但有一个针对此可恶特性的应用程序。 It is the improvement of performance.是性能的提升。

Using a goto statement can be faster than a function invocation. 使用goto语句可能比 function 调用更快。 This is because the amount of "paperwork", like setting up the stack and returning a value, that has to be done when invoking a function.这是因为在调用 function 时必须完成大量“文书工作”,例如设置堆栈和返回值。 Meanwhile a goto can sometimes be converted into a single jmp assembly instruction.同时,有时可以将goto转换为单个jmp汇编指令。

To exploit the full potential of goto an extension to the GCC compiler was made that allows goto to be more dynamic.为了发挥goto的全部潜力,对 GCC 编译器进行了扩展,使goto更加动态。 That is, the label to jump to can be determined at run-time.也就是说,可以在运行时确定要跳转到的 label。

This extension allows one to obtain a label pointer , similar to a function pointer and goto ing to it:此扩展允许获取label 指针,类似于 function 指针并goto它:

    void* label_ptr = &&some_label;
    goto (*label_ptr);

some_label:
    do_something();

This is an interesting concept that allows us to further enhance our simple VM.这是一个有趣的概念,它使我们能够进一步增强我们的简单 VM。 Instead of using a switch statement we will use an array of label pointers (a so called jump table ) and than goto to the appropriate one (the opcode will be used to index the array):我们将使用 label 指针数组(所谓的跳转表)而不是使用switch语句,然后goto适当的指针(操作码将用于索引数组):

// [Courtesy of Eli Bendersky][4]
// This code is licensed with the [Unlicense][5]

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {
        &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
        &&do_div2, &&do_add7, &&do_neg};
    #define DISPATCH() goto *dispatch_table[code[pc++]]

    int pc = 0;
    int val = initval;

    DISPATCH();
    while (1) {
        do_halt:
            return val;
        do_inc:
            val++;
            DISPATCH();
        do_dec:
            val--;
            DISPATCH();
        do_mul2:
            val *= 2;
            DISPATCH();
        do_div2:
            val /= 2;
            DISPATCH();
        do_add7:
            val += 7;
            DISPATCH();
        do_neg:
            val = -val;
            DISPATCH();
    }
}

This version is about 25% faster than the one that uses a switch (the one on the linked blog post, not the one above).此版本比使用switch的版本(链接博客文章中的版本,而不是上面的版本)快约 25%。 This is because there is only one jump performed after each operation, instead of two.这是因为每次操作后只执行一次跳转,而不是两次。

Control flow with switch :switch的控制流: 2跳带开关 For example, if we wanted to execute Opcode::FOO and then Opcode::SOMETHING , it would look like this:例如,如果我们想执行Opcode::FOO然后Opcode::SOMETHING ,它看起来像这样: 在此处输入图像描述 As you can see, there are two jumps being performed after an instruction is executed.如您所见,在执行一条指令后会执行两次跳转。 The first one is back to the switch code and the second is to the actual instruction.第一个返回到switch代码,第二个返回到实际指令。

In contrary, if we would go with an array of label pointers (as a reminder, they are non-standard), we would have only one jump:相反,如果我们将 go 与 label 指针数组一起使用(提醒一下,它们是非标准的),我们将只有一次跳转: 在此处输入图像描述

It is worthwhile to note that in addition to saving cycles by doing less operations, we also enhance the quality of branch prediction by eliminating the additional jump.值得注意的是,除了通过减少操作来节省循环之外,我们还通过消除额外的跳转来提高分支预测的质量。

Now, we know that by using an array of label pointers instead of a switch we can improve the performance of our VM significantly (by about 20%).现在,我们知道通过使用 label 指针数组而不是switch ,我们可以显着提高 VM 的性能(大约 20%)。 I figured that maybe this could have some other applications too.我想也许这也可能有其他一些应用程序。

I came to the conclusion that this technique could be used in any program that has a loop in which it sequentially indirectly dispatches some logic.我得出的结论是,这种技术可以用于任何具有循环的程序中,在该循环中,它顺序间接地分派一些逻辑。 A simple example of this (apart from the VM) could be invoking a virtual method on every element of a container of polymorphic objects:一个简单的例子(除了虚拟机)可以在多态对象容器的每个元素上调用一个virtual方法:

std::vector<Base*> objects;
objects = get_objects();
for (auto object : objects)
{
    object->foo();
}

Now, this has much more applications.现在,这有更多的应用。

There is one problem though: There is nothing such as label pointers in standard C++.但是有一个问题:标准 C++ 中没有诸如 label 指针之类的东西。 As such, the question is: Is there a way to simulate the behaviour of computed goto s in standard C++ that can match them in performance?因此,问题是:有没有办法模拟标准 C++ 中计算的goto的行为,可以在性能上匹配它们? . .

Edit 1:编辑1:

There is yet another down side to using the switch.使用开关还有另一个缺点。 I was reminded of it by user1937198 . user1937198提醒了我。 It is bound checking.它是绑定检查。 In short, it checks if the value of the variable inside of the switch matches any of the case s.简而言之,它检查switch内部变量的值是否与任何case匹配。 It adds redundant branching (this check is mandated by the standard).它添加了冗余分支(此检查是标准规定的)。

Edit 2:编辑2:

In response to cmaster , I will clarify what is my idea on reducing overhead of virtual function calls. 作为对 cmaster 的回应,我将阐明我对减少虚拟 function 调用开销的想法。 A dirty approach to this would be to have an id in each derived instance representing its type, that would be used to index the jump table (label pointer array).一个肮脏的方法是在每个派生实例中都有一个表示其类型的 id,这将用于索引跳转表(标签指针数组)。 The problem is that:问题是:

  1. There are no jump tables is standard C++没有跳表是标准的C++
  2. It would require as to modify all jump tables when a new derived class is added.当添加新的派生 class 时,需要修改所有跳转表。

I would be thankful, if someone came up with some type of template magic (or a macro as a last resort), that would allow to write it to be more clean, extensible and automated, like this:我会很感激,如果有人想出了某种类型的模板魔法(或作为最后手段的宏),这将允许将其编写得更干净、可扩展和自动化,如下所示:

On a recent versions of MSVC, the key is to give the optimizer the hints it needs so that it can tell that just indexing into the jump table is a safe transform.在最新版本的 MSVC 中,关键是为优化器提供所需的提示,以便它知道仅对跳转表进行索引是一种安全的转换。 There are two constraints on the original code that prevent this, and thus make optimising to the code generated by the computed label code an invalid transform.原始代码有两个限制可以防止这种情况发生,因此对计算出的 label 代码生成的代码进行优化是无效的转换。

Firstly in the original code, if the program counter overflows the program, then the loop exits.首先在原始代码中,如果程序计数器溢出程序,则循环退出。 In the computed label code, undefined behavior (dereferencing an out of range index) is invoked.在计算的 label 代码中,调用了未定义的行为(取消引用超出范围的索引)。 Thus the compiler has to insert a check for this, causing it to generate a basic block for the loop header rather than inlining that in each switch block.因此编译器必须为此插入一个检查,导致它为循环 header 生成一个基本块,而不是在每个 switch 块中内联它。

Secondly in the original code, the default case is not handled.其次在原始代码中,默认情况不处理。 Whilst the switch covers all enum values, and thus it is undefined behavior for no branches to match, the msvc optimiser is not intelligent enough to exploit this, so generates a default case that does nothing.虽然开关涵盖所有枚举值,因此没有分支匹配是未定义的行为,但 msvc 优化器没有足够智能来利用这一点,因此会生成一个不执行任何操作的默认情况。 Checking this default case requires a conditional as it handles a large range of values.检查这个默认情况需要一个条件,因为它处理大范围的值。 The computed goto code invokes undefined behavior in this case as well.在这种情况下,计算的 goto 代码也会调用未定义的行为。

The solution to the first issue is simple.第一个问题的解决方案很简单。 Don't use a c++ range for loop, use a while loop or a for loop with no condition.不要使用 c++ 范围 for 循环,使用 while 循环或无条件的 for 循环。 The solution for the second unfortunatly requires platform specific code to tell the optimizer the default is undefined behavior in the form of _assume(0) , but something analogous is present in most compilers ( __builtin_unreachable() in clang and gcc), and can be conditionally compiled to nothing when no equivalent is present without any correctness issues.不幸的是,第二个解决方案需要特定于平台的代码来告诉优化器默认是_assume(0)形式的未定义行为,但大多数编译器中都存在类似的东西(clang 和 gcc 中的__builtin_unreachable() ),并且可以有条件地在没有任何正确性问题的情况下不存在等效项时编译为空。

So the result of this is:所以这样做的结果是:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int run(Opcode* program) {
    int result = 0;
    for (int i = 0; true;i++)
    {
        auto instruction = program[i];
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        default:
            __assume(0);
        }
    }
}

The generated assembly can be verified on godbolt生成的程序集可以在Godbolt上进行验证

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM