简体   繁体   English

在计算中使用布尔以避免分支

[英]Using bools in calculations to avoid branches

Here's a little micro-optimization curiosity that I came up with: 这是我想到的一些微优化的好奇心:

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void step_versionTwo(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }
};

It seems the two methods practically do the same thing. 看来这两种方法实际上是在做同样的事情。 Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3 ? 第二个版本是否避免分支(因此比第一个版本快),或者任何编译器都可以使用-O3进行这种优化?

Yes, your trick allows to avoid branch and it makes it faster... sometimes. 是的,您的trick俩可以避免分支,并且可以更快地执行分支...有时。

I wrote benchmark that compares these solutions in various situations, along with my own: 我编写了基准,比较了各种情况下的这些解决方案以及自己的解决方案:

ticks += mStepSize & -static_cast<int>(running)

My results are following: 我的结果如下:

Off:
 branch: 399949150
 mul:    399940271
 andneg: 277546678
On:
 branch: 204035423
 mul:    399937142
 andneg: 277581853
Pattern:
 branch: 327724860
 mul:    400010363
 andneg: 277551446
Random:
 branch: 915235440
 mul:    399916440
 andneg: 277537411

Off is when timers are turned off. Off是指定时器关闭时。 In this cases solutions take about the same time. 在这种情况下,解决方案大约需要相同的时间。

On is when they are turned on. On是指它们被打开时的状态。 Branching solution two times faster. 分支解决方案快两倍。

Pattern is when they are in 100110 pattern. Pattern是当它们处于100110模式时。 Performance is similar, but branching is a bit faster. 性能相似,但是分支速度更快。

Random is when branch is unpredictable. Random是指分支不可预测。 In this case multiplications is more than 2 times faster. 在这种情况下,乘法快两倍以上。

In all cases my bit-hacking trick is fastest, except for On where branching wins. 在任何情况下我有点黑客伎俩是最快的,除了On地方分支胜。

Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest). 请注意,此基准并不一定代表所有编译器版本的处理器等。即使基准的很小变化也可以使结果上下颠倒(例如,如果编译器可以内联地知道mStepSize1 ,那么乘法实际上可能是最快的)。

Code of the benchmark: 基准代码:

#include<array>
#include<iostream>
#include<chrono>

struct Timer {
    bool running{false};
    int ticks{0};

    void branch(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void mul(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }

    void andneg(int mStepSize) {
        ticks += mStepSize & -static_cast<int>(running);
    }
};

void run(std::array<Timer, 256>& timers, int step) {
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.branch(step);
    auto end = std::chrono::steady_clock::now();
    std::cout << "branch: " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.mul(step);
    end = std::chrono::steady_clock::now();
    std::cout << "mul:    " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.andneg(step);
    end = std::chrono::steady_clock::now();
    std::cout << "andneg: " << (end - start).count() << std::endl;
}

int main() {
    std::array<Timer, 256> timers;
    int step = rand() % 256;

    run(timers, step); // warm up
    std::cout << "Off:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = true;
    std::cout << "On:\n";
    run(timers, step);
    std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
    for(int i = 0; i < 256; i++)
        timers[i].running = pattern[i % 6];
    std::cout << "Pattern:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = rand()&1;
    std::cout << "Random:\n";
    run(timers, step);
    for(auto& t : timers)
        std::cout << t.ticks << ' ';
    return 0;
}

Does the second version avoid a branch

if you compile your code to get assembler output, g++ -o test.s test.cpp -S , you'll find that a branch is indeed avoided in the second function. 如果编译代码以获取汇编器输出g++ -o test.s test.cpp -S ,则会发现在第二个函数中确实避免了分支。

and consequently, is faster than the first version

i ran each of your functions 2147483647 or INT_MAX number of times where in each iteration i randomly assigned a boolean value to running member of your Timer struct, using this code: 我运行了每个函数2147483647INT_MAX次,其中在每次迭代中,我使用以下代码将一个布尔值随机分配给Timer结构的running成员:

int main() {
    const int max = std::numeric_limits<int>::max();
    timestamp_t start, end, one, two;
    Timer t_one, t_two;
    double percent;

    srand(time(NULL));

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_one.running = rand() % 2;
        t_one.step_versionOne(1);
    }
    end = get_timestamp();
    one = end - start;

    std::cout << "step_versionOne      = " << one << std::endl;

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_two.running = rand() % 2;
        t_two.step_versionTwo(1);
    }
    end = get_timestamp();
    two = end - start;

    percent = (one - two) / static_cast<double>(one) * 100.0;

    std::cout << "step_versionTwo      = " << two << std::endl;
    std::cout << "step_one - step_two  = " << one - two << std::endl;
    std::cout << "one fast than two by = " << percent << std::endl;
 }

and these are the results i got: 这些是我得到的结果:

step_versionOne      = 39738380
step_versionTwo      = 26047337
step_one - step_two  = 13691043
one fast than two by = 34.4529%

so yes, the second function is clearly faster, and by around 35%. 所以是的,第二个功能显然更快,大约提高了35%。 note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. 请注意,对于较小的迭代次数,定时性能的百分比增长介于30%和55%之间,而运行时间越长,似乎稳定在35%左右。 this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, ie consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true) 这可能是由于在模拟运行时系统任务的零星执行所致,因此零星的发生变得更少了,即,您运行sim卡的时间越长,一致性越好(尽管这只是我的假设,但我不知道它是否真的成立)

all in all, nice question, i learned something today! 总而言之,一个很好的问题,我今天学到了一些东西!


MORE: 更多:


of course, by randomly generating running , we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. 当然,通过随机生成running ,我们实质上在第一个函数中使分支预测无效,因此上面的结果并不令人惊讶。 however, if we decide to not alter running during loop iterations and instead leave it at its default value, in this case false , branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest: 但是,如果我们决定在循环迭代期间不更改running ,而是将其保留为默认值(在这种情况下为false ,则分支预测将在第一个函数中发挥作用,并且实际上将快20%,如这些结果所示:

step_versionOne      = 6273942
step_versionTwo      = 7809508
step_two - step_one  = 1535566
two fast than one by = 19.6628

because running is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running - result of a compiler optimization likely. 由于running在整个执行过程中是恒定的,因此请注意,仿真时间比随机更改的running时间要短得多-可能是编译器优化的结果。

why is the second function slower in this case? 为什么在这种情况下第二个功能变慢? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize; isn't even there). 好了,分支预测将很快意识到第一个函数中的条件是永远不会满足的,因此它将if(running) ticks += mStepSize;停止检查(好像if(running) ticks += mStepSize;甚至不存在)。 on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running); 另一方面,第二个函数仍将必须执行此指令ticks += mStepSize * static_cast<int>(running); in every iteration, thus making the first function more efficient. 在每个迭代中,因此使第一个功能更有效。

but what if we set running to true ? 但是,如果我们将running设置为true怎么办? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize; 好了,分支预测将再次开始,但是,这一次,第一个函数将必须计算ticks += mStepSize; in every iteration; 在每次迭代中 here the results when running{true} : 在这里running{true}时的结果running{true}

step_versionOne      = 7522095
step_versionTwo      = 7891948
step_two - step_one  = 369853
two fast than one by = 4.68646

notice that step_versionTwo takes a consistent amount of time whether running is constantly true or false . 请注意,无论runningtrue还是falsestep_versionTwo都会花费一致的时间。 but it still takes longer than step_versionTwo , however marginally. 但是它仍然比step_versionTwo花费更长的时间,但是幅度很小。 well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). 好吧,这可能是因为我太懒了,无法多次运行它来确定它是否持续更快,或者是一次偶然(由于每次运行它的结果都会略有不同,因为操作系统必须在后台运行)而且并非总是会做同样的事情)。 but if it is consistently faster, it might be because function two ( ticks += mStepSize * static_cast<int>(running); ) has an arithmetic op more than function one ( ticks += mStepSize; ). 但是,如果它始终更快,那可能是因为函数2( ticks += mStepSize * static_cast<int>(running); )的运算量大于函数1( ticks += mStepSize; )。

finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1 and let's revert running back to false and then check the results: 最后,让我们进行优化编译g++ -o test test.cpp -std=c++11 -O1 ,然后将running还原为false ,然后检查结果:

step_versionOne      = 704973
step_versionTwo      = 695052

more or less the same. 或多或少相同。 the compiler will do its optimization pass, and realize running is always false and will thus, for all intents and purposes, remove the body of step_versionOne , so when you call it from the loop in main , it'll just call the function and return. 编译器将执行其优化过程,并意识到running始终为false ,因此出于所有意图和目的,都将删除step_versionOne的主体,因此,当您从main的循环中调用它时,它将仅调用该函数并返回。

on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running); 另一方面,当优化第二个函数时,它将意识到ticks += mStepSize * static_cast<int>(running); will always generate the same result, ie 0 , so it won't bother executing that either. 总是会产生相同的结果,即0 ,因此也不会执行它。

all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main loop is their overhead. 总而言之,如果我是正确的(如果不正确,请纠正我,我对此很陌生),从main循环调用这两个函数时所得到的只是它们的开销。

ps here's the result for the first case ( running is randomly generated in every iteration) when compiled with an optimization. ps这是使用优化进行编译时的第一种情况( running在每次迭代中随机生成)的结果。

step_versionOne      = 18868782
step_versionTwo      = 18812315
step_two - step_one  = 56467
one fast than two by = 0.299261

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM