简体   繁体   English

加速 if-else 阶梯 C++

[英]Speedup an if-else ladder C++

I have a piece of code that demands execution speed over anything else.我有一段代码要求执行速度高于其他任何东西。 By using the high_resolution_clock() from std::chrono I found out that this switch() into if-else() ladder is taking over 70% of my execution time.通过使用std::chrono中的high_resolution_clock() ,我发现这个 switch() 到 if-else() 梯形图占用了我 70% 的执行时间。 Is there any way to speed this up?有什么办法可以加快这个速度吗?

I'm using gcc with -O3 optimization during compiling.我在编译期间使用gcc-O3优化。

I looked into a similar question: If else ladder optimisation but I can't use a return statement as it would exit the outer loop which I can't.我研究了一个类似的问题: If else 梯形优化但我不能使用 return 语句,因为它会退出我不能的外循环。

switch(RPL_OPTION) {
            case 0:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 1:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 2:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(upper_th);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(lower_th);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 3:
                //Generating a gaussian noise distribution with 0 mean and 1 std deviation
                default_random_engine generator(time(0));
                normal_distribution<float> dist(0,1);

                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
        }

A few optimizations that come to mind:想到的一些优化:

  1. vector.push_back() or emplace_back() , even with reserve() , are poison for performance because no compiler is able to vectorize the code. vector.push_back()emplace_back() ,即使使用reserve()也会影响性能,因为没有编译器能够向量化代码。 We can work with plain C pointers instead or just preallocate.我们可以使用普通的 C 指针,或者只是预分配。

  2. Generating the random engine and distribution in the last case may have significant cost if this code is called repeatedly.如果重复调用此代码,则在最后一种情况下生成随机引擎和分发可能会产生巨大的成本。 We can hoist this out of the code.我们可以把它从代码中提升出来。 Note that this will also avoid issues with the repeated initialization for which you use a low-resolution time function.请注意,这也将避免使用低分辨率时间函数的重复初始化问题。

  3. This may be unnecessary but rewriting the code a bit may allow more compiler optimizations, especially by turning things into conditional move-instructions and reducing the number of branches.这可能是不必要的,但稍微重写代码可能会允许更多的编译器优化,特别是通过将事情变成条件移动指令并减少分支的数量。

/* TODO: We have better ways of initializing generators but that is
 * unrelated to its performance
 * I'm lazy and turn this into a static variable. Better use a
 * different pattern (like up in the stack somewhere)
 * but you get the idea
 */
static default_random_engine generator(time(0));
static normal_distribution<float> dist(0,1);

std::size_t output_pos = filtered_output.size();
filtered_output.resize(output_pos + WINDOW_SIZE);
flag_output.resize(output_pos + WINDOW_SIZE);

switch(RPL_OPTION) {
case 0:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        filtered_output[output_pos + k] = ansk;
        flag_output[output_pos + k] = flag;
    }
    break;
case 1:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        // written carefully to help compiler turning this into a CMOV
        auto filtered = flag ? RPL_CONST : ansk;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
case 2:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        auto filtered = ansk < lower_th ? lower_th : ansk;
        filtered = ansk >= upper_th ? upper_th : filtered;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
case 3:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        // optimized under the assumption that flag is usually 1
        auto ansk = ans[k];
        auto random = dist(generator) * sigma;
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        auto filtered = flag ? random : ansk;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
}

Analyzing compiler output分析编译器输出

I checked the resulting code with Godbolt.我用 Godbolt 检查了生成的代码。 Cases 0-2 do vectorize.案例 0-2 进行矢量化。 However, a lot hinges on good alias detection.但是,很大程度上取决于良好的别名检测。 So this needs to be analyzed in the context of the full function containing this code.所以这需要在包含此代码的完整函数的上下文中进行分析。 Particular pain points are具体痛点是

  • Potential alias between ans and filtered_output . ansfiltered_output之间的潜在别名。 That is hard to avoid but I think compilers should be able to create code that check against this这很难避免,但我认为编译器应该能够创建代码来检查这个
  • Potential alias between the thresholds + RPL_CONST and the filtered_output .阈值 + RPL_CONSTfiltered_output之间的潜在别名。 When in doubt, copy the inputs into a local variable (which the compiler can prove to be alias free).如有疑问,请将输入复制到局部变量中(编译器可以证明该变量是无别名的)。 Just marking them const may not be enough仅仅将它们标记为 const 可能还不够
  • Potential alias between flag_count and flag_output , depending on the types. flag_countflag_output之间的潜在别名,具体取决于类型。 Again, better use a local variable for the count, then copy it to its output, if required同样,最好使用局部变量进行计数,然后在需要时将其复制到其输出

As for case 3, computing a random sample is expensive enough that my optimization may degrade performance if the inputs are usually within limits.至于案例 3,计算随机样本的成本足够高,如果输入通常在限制范围内,我的优化可能会降低性能。 That needs benchmarking.这需要基准测试。 The longer I think about it, losing a few clock cycles on a mis-predict is probably much less time than computing a sample without using it.我考虑的时间越长,在错误预测中丢失几个时钟周期可能比不使用它计算样本的时间要少得多。

Removing redundant code删除冗余代码

The resulting code is highly redundant.生成的代码是高度冗余的。 We could move the switch-case into the loop but that messes with the vectorization.我们可以将 switch-case 移到循环中,但这会与矢量化相混淆。 Instead, we can use a template function pattern.相反,我们可以使用模板函数模式。


class Filter
{
    int WINDOW_SIZE;
    float upper_th, lower_th, sigma, RPL_CONST;
    std::default_random_engine generator;
    std::normal_distribution<float> dist;

    template<class FilterOp>
    int apply(std::vector<float>& filtered_output,
              std::vector<int>& flag_output,
              const std::vector<float>& ans, FilterOp filter)
    {
        // move stuff into local variables to help with alias detection
        const int WINDOW_SIZE = this->WINDOW_SIZE;
        const float upper_th = this->upper_th, lower_th = this->lower_th;
        const std::size_t output_pos = filtered_output.size() - WINDOW_SIZE;
        int flag_count = 0;
        for(int k = 0; k < WINDOW_SIZE; k++) {
            auto ansk = ans[k];
            int flag = (ansk >= upper_th) | (ansk < lower_th);
            flag_count += flag;
            filtered_output[output_pos + k] = filter(ansk, flag);
            flag_output[output_pos + k] = flag;
        }
        return flag_count;
    }
public:
    int operator()(int RPL_OPTION,
              std::vector<float>& filtered_output,
              std::vector<int>& flag_output,
              const std::vector<float>& ans)
    {
        std::size_t output_pos = filtered_output.size();
        filtered_output.resize(output_pos + WINDOW_SIZE);
        flag_output.resize(output_pos + WINDOW_SIZE);
        switch(RPL_OPTION) {
        case 0:
            return apply(filtered_output, flag_output, ans,
                [](float ansk, int flag) noexcept -> float {
                    return ansk;
            });
        case 1:
            return apply(filtered_output, flag_output, ans,
                [RPL_CONST=this->RPL_CONST](float ansk, int flag) noexcept -> float {
                    return flag ? RPL_CONST : ansk;
            });
        case 2:
            return apply(filtered_output, flag_output, ans,
                [lower_th=this->lower_th, upper_th=this->upper_th](
                      float ansk, int flag) noexcept -> float {
                    auto filtered = ansk < lower_th ? lower_th : ansk;
                    return ansk >= upper_th ? upper_th : filtered;
            });
         case 3:
            return apply(filtered_output, flag_output, ans,
                [this](float ansk, int flag) noexcept -> float {
                    return flag ? dist(generator)*sigma : ansk;
            });
         default: return 0;
       }
    }
};

I am nearly 98% sure that the if-else ladder is not the problem.我几乎 98% 确定 if-else 阶梯不是问题。

The std::vector s (or whatever container you use) push_back function with tons of reallocations and data copying is for me the main candidate for optimization.对我来说,具有大量重新分配和数据复制的std::vector s(或您使用的任何容器) push_back函数是优化的主要候选者。

Please use the reserve function to allocate the needed memory beforehand.请使用reserve功能预先分配所需的内存。

Then move out all invariant stuff, like然后移出所有不变的东西,比如

default_random_engine generator(time(0));
normal_distribution<float> dist(0,1);

But without more example code, it is hard to judge.但是没有更多的示例代码,很难判断。

A profiler will give you better results.分析器将为您提供更好的结果。 Timer functions will not help a lot here.定时器功能在这里不会有太大帮助。

I have a piece of code that demands execution speed over anything else我有一段代码要求执行速度高于其他任何东西

That suggests a non-obvious approach.这表明了一种不明显的方法。 The code pattern looks familiar enough from a signal processing viewpoint, so WINDOW_SIZE is likely non-trivial.从信号处理的角度来看,代码模式看起来很熟悉,因此WINDOW_SIZE可能并不重要。 In that case, using AVX2 with packed comparisons makes sense.在这种情况下,将 AVX2 与打包比较一起使用是有意义的。

In short, you pack a whole AVX2 register full of inputs, use two AVX2 registers to store copies of the lower and upper threshold, and issue the two comparisons.简而言之,您将一个完整的 AVX2 寄存器打包为输入,使用两个 AVX2 寄存器来存储下限和上限阈值的副本,然后发出两个比较。 This gives you two outputs, where each value is either 0 or ~0 .这为您提供了两个输出,其中每个值为 0 或~0

Hence, your flag count be determined by or-ing the two registers.因此,您的标志数由两个寄存器或两个寄存器确定。 It's tempting to count the flags already, but this is considered a slow "horizontal add".已经很想计算标志了,但这被认为是一个缓慢的“水平添加”。 Better to keep track of this in another AVX register, and do one horizontal add at the end.最好在另一个 AVX 寄存器中跟踪这一点,并在最后做一个水平添加。

The updates to filtered_output depend on the case, but for 1 and 2 you can use AVX for this as well. filtered_output的更新取决于具体情况,但对于 1 和 2,您也可以使用 AVX。 Choosing between two values based on the bits in a third register can be done with mm256_blendv_epi8 .可以使用mm256_blendv_epi8根据第三个寄存器中的位在两个值之间进行选择。 You can safely ignore the 8 there, that's the minimum resolution (one byte).您可以放心地忽略那里的8 ,这是最小分辨率(一个字节)。 If you're doing 32 bits comparisons, your result register will also contain 32 bits outcomes, so mm256_blendv_epi8 will work with 4*8 bits resolution.如果您进行 32 位比较,您的结果寄存器也将包含 32 位结果,因此mm256_blendv_epi8将以 4*8 位分辨率工作。

If you have case 0: of course should just be a straight copy to filtered_output , outside the if statements.如果您有case 0:当然应该只是在if语句之外直接复制到filtered_output

The first thing to notice is that you push_back on vectors.首先要注意的是你push_back在向量上。 The code shows no call to reserve so this will resize the vector as it grows over and over.代码显示没有调用reserve ,因此这将随着向量的不断增长而调整其大小。 That's probably more expensive than anything else in the loop.这可能比循环中的其他任何东西都贵。

The next things concerns the "if-else-ladder":接下来的事情涉及“if-else-ladder”:

Have you actually profiled if the ladder is a problem at all?如果梯子有问题,你真的分析过吗? Branches are only expensive when they mispredict.分支只有在错误预测时才会变得昂贵。 Maybe the branch predictor works juse fine on your input?也许分支预测器在您的输入上工作得很好? Assuming this switch statement is executed many times one way to help it out would be to sort the input.假设这个 switch 语句被执行了很多次,一种帮助它的方法是对输入进行排序。 Then the if-else-ladder wouldn't jump randomly every time but repeat the same way many times before switching to a new case.然后 if-else-ladder 不会每次都随机跳跃,而是在切换到新案例之前以相同的方式重复多次。 But that only helps if the loop runs many times or the cost of sorting will negate any improvement.但这仅在循环运行多次或排序成本将抵消任何改进时才有帮助。

And if the switch is repeated many times you can split the input into 3 groups once, for the 3 options in the ladder, and process each group without any if-else-ladder.如果开关重复多次,您可以一次将输入分成 3 组,用于梯形图中的 3 个选项,并在没有任何 if-else-ladder 的情况下处理每个组。

Looking closer at the code I see the first 2 cases in the if-else-ladder are identical (except case 2).仔细查看代码,我发现 if-else-ladder 中的前 2 个案例是相同的(案例 2 除外)。 So you can combine the tests making this a simple "if-else":因此,您可以结合测试使其成为一个简单的“if-else”:

if ((ans[k] >= upper_th) || (ans[k] < lower_th))

Now due to lazy evaluation this will produce the same code as before.现在由于惰性评估,这将产生与以前相同的代码。 But you can do better:但你可以做得更好:

if ((ans[k] >= upper_th) | (ans[k] < lower_th))

Now both parts get evaluated since there is no lazy evaluation of |.现在这两个部分都得到了评估,因为没有对 | 进行惰性评估。 Except compilers are artifical stupids and might just do lazy evaluation anyway.除了编译器是人为的愚蠢之外,无论如何都可能只是进行惰性评估。 At this point you are fighting the optimizer.此时,您正在与优化器作斗争。 Some compilers will optimize this into 2 branches, some leave it at one.一些编译器会将其优化为 2 个分支,有些则将其保留为一个。

You can use something like the following trick there:您可以在那里使用以下技巧:

static auto fn[] = {
    [&]() { code for first choice; };
    [&]() { code for second choice; };
};
fn[((ans[k] >= upper_th) | (ans[k] < lower_th))]();

By turning the condition of your if into computing the index of an array the compiler optimization producing 2 branches is circumvented.通过将if的条件转换为计算数组的索引,可以规避产生 2 个分支的编译器优化。 Hopefully.希望。 At least till the next compiler update.至少直到下一次编译器更新。 :) :)

When fighting the optimizer you have to recheck your solution every time you update the compiler.在与优化器作斗争时,每次更新编译器时都必须重新检查解决方案。

And for case 2 the difference in the code is only what value to push_back.而对于案例 2,代码中的差异只是 push_back 的值。 That can be turned into a conditional move on most architectures instead of a branch if you use如果您使用,这可以变成大多数架构而不是分支的条件移动

(ans[k] >= upper_th) ? upper_th : lower_th;

for the push_back.对于 push_back。

Firstly, sorting ans could be a good idea because of bench prediction.首先,由于基准预测,排序ans可能是一个好主意。 But it is up to mostly size of the ans .但这主要取决于ans的大小。 Secondly, if you are using c++20 you can take a look at [LIKELY] and [UNLIKELY] keywords.其次,如果您使用的是 c++20,您可以查看 [LIKELY] 和 [UNLIKELY] 关键字。 If you can select which statement is picking mostly or the opposite you can easily use them.如果您可以选择主要选择哪个语句或相反,您可以轻松使用它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM