简体   繁体   English

C这个无分支的黑客实际上更快吗?

[英]C Is this branchless hack actually faster?

I'm trying to clamp a value between -127 and 127 on a Cortex-M based microcontroller . 我试图在基于Cortex-M的微控制器上钳制-127到127之间的值。

I have two competing functions, one uses conditionals the other uses a branchless hack I found here . 我有两个竞争功能,一个使用条件,另一个使用我在这里找到的无分支黑客。

// Using conditional statements
int clamp(int val) { return ((val > 127) ? 127 : (val < -127) ? -127 : val); }

// Using branchless hacks
int clamp(int val) {
    val -= -127;
    val &= (~val) >> 31;
    val += -127;
    val -= 127;
    val &= val >> 31;
    val += 127;

    return val;
}

Now I know in some cases one of these methods might be faster than the other, and vise-versa but in general is it worth it to use the branchless technique seeing as it doesn't really matter to me which I use, they both will work just fine in my case? 现在我知道在某些情况下,这些方法中的一种可能比另一种方法更快,反之亦然,但总的来说使用无分支技术是值得的,因为它对我来说并不重要,我们都会使用在我的情况下工作得很好?

A little background on the microcontroller, it's a ARM based microcontroller running at 90 MIPS with a 3 stage pipeline, fetch, decode and execute and it seems to have some sort of branch predictor but I couldn't dig up details. 微控制器的一个小背景,它是一个基于ARM的微控制器,运行速度为90 MIPS,具有3级流水线,取指,解码和执行,它似乎有某种分支预测器,但我无法挖掘细节。

ARM code (GCC 4.6.3 with -O3 ): ARM代码(带-O3 GCC 4.6.3):

clamp1:
    mvn r3, #126
    cmp r0, r3
    movlt   r0, r3
    cmp r0, #127
    movge   r0, #127
    bx  lr

clamp2:
    add r0, r0, #127
    mvn r3, r0
    and r0, r0, r3, asr #31
    sub r0, r0, #254
    and r0, r0, r0, asr #31
    add r0, r0, #127
    bx  lr

Thumb code: 拇指代码:

clamp1:
    mvn r3, #126
    cmp r0, r3
    it  lt
    movlt   r0, r3
    cmp r0, #127
    it  ge
    movge   r0, #127
    bx  lr

clamp2:
    adds    r0, r0, #127
    mvns    r3, r0
    and r0, r0, r3, asr #31
    subs    r0, r0, #254
    and r0, r0, r0, asr #31
    adds    r0, r0, #127
    bx  lr

Both are branchless thanks to ARM's conditional execution design. 由于ARM的条件执行设计,两者都是无分支的。 I will bet you they are essentially comparable in performance. 我敢打赌,他们在性能上基本相当。

Something to realize is the the ARM and x86 architectures are very different when it comes to branch instructions. 要实现的是ARM和x86架构在分支指令方面有很大的不同。 Taking a jump clears the pipeline which can result in the expediture of a number of clock cycles just to 'get back to where you were' in terms of throughput. 跳转可以清除管道,这可以导致多个时钟周期的流失,以便在吞吐量方面“回到原来的位置”。

To quote a pdf I downloaded the other day (pg14 of http://simplemachines.it/doc/arm_inst.pdf ), 引用我前几天下载的pdf(第14页http://simplemachines.it/doc/arm_inst.pdf ),

Conditional Execution 条件执行

  • Most instruction sets only allow branches to be executed conditionally. 大多数指令集只允许有条件地执行分支。
  • However by reusing the condition evaluation hardware, ARM effectively increases number of instructions. 但是,通过重用条件评估硬件,ARM可以有效地增加指令数量。
  • All instructions contain a condition field which determines whether the CPU will execute them. 所有指令都包含一个条件字段,用于确定CPU是否执行它们。
  • Non-executed instructions soak up 1 cycle. 未执行的指令吸收1个周期。 – Still have to complete cycle so as to allow fetching and decoding of following instructions. - 仍然必须完成循环,以便允许获取和解码以下指令。
  • This removes the need for many branches, which stall the pipeline (3 cycles to refill). 这消除了对许多分支的需要,这使得管道停止(3个循环以重新填充)。
  • Allows very dense in-line code, without branches. 允许非常密集的内联代码,没有分支。
  • The Time penalty of not executing several conditional instructions is frequently less than overhead of the branch or subroutine call that would otherwise be needed. 不执行多个条件指令的时间惩罚通常小于否则将需要的分支或子例程调用的开销。

No. The C language doesn't have speed; 没有.C语言没有速度; That's a concept that's introduced by implementations of C. A perfectly optimal compiler would translate both of those to the same machine code. 这是由C的实现引入的概念。完美的最佳编译器会将这两者转换为相同的机器代码。

C compilers are more likely to be able to optimise code that conforms to common styles and is well defined. C编译器更有可能优化符合常见样式的代码并且定义明确。 The second function isn't well defined. 第二个功能没有明确定义。

Those additions and subtractions could cause integer overflows. 这些加法和减法可能导致整数溢出。 Integer overflows are undefined behaviour, so they could cause your program to malfunction. 整数溢出是未定义的行为,因此它们可能导致程序出现故障。 Optimistically, your hardware might implement wrapping or saturation. 乐观地说,您的硬件可能会实现包装或饱和。 Slightly less optimistically, your OS or compiler might implement signals or trap representations for integer overflows. 稍微不那么乐观,您的操作系统或编译器可能会为整数溢出实现信号或陷阱表示。 Detecting integer overflows might affect the percieved performance of modifying a variable. 检测整数溢出可能会影响修改变量的性能。 The worst case is that your program loses it's integrity. 最糟糕的情况是你的程序失去了它的完整性。

The & and >> operators have implementation-defined aspects for signed types. &和>>运算符具有已签名类型的实现定义方面。 They may result in a negative-zero, which is an example of a trap representation. 它们可能导致负零,这是陷阱表示的一个例子。 Using a trap representation is undefined behaviour, so your program could lose it's integrity. 使用陷阱表示是未定义的行为,因此您的程序可能会失去其完整性。

Perhaps your OS or compiler implements parity bit checks for int objects. 也许您的操作系统或编译器对int对象执行奇偶校验位检查。 In this case, try to imagine recalculating the parity bits every time a variable changes and verifying the parity bits every time a variable is read. 在这种情况下,尝试想象每次变量更改时重新计算奇偶校验位,并在每次读取变量时验证奇偶校验位。 If a parity check fails, your program could lose it's integrity. 如果奇偶校验失败,您的程序可能会失去其完整性。

Use the first function. 使用第一个功能。 At least it's well defined. 至少它定义得很好。 If your program appears to be running slow, optimising this code probably won't speed your program up significantly; 如果您的程序运行缓慢,优化此代码可能不会显着加快您的程序速度; Use a profiler to find more significant optimisations, use a more optimal OS or compiler or buy faster hardware. 使用分析器查找更重要的优化,使用更优化的操作系统或编译器或购买更快的硬件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM