简体   繁体   English

__builtin_unreachable 促进了哪些优化?

[英]What optimizations does __builtin_unreachable facilitate?

Judging from gcc's documentation从 gcc 的文档来看

If control flow reaches the point of the __builtin_unreachable , the program is undefined.如果控制流到达__builtin_unreachable的点,则程序未定义。

I thought __builtin_unreachable may be used as a hint to the optimizer in all sorts of creative ways.我认为__builtin_unreachable可以以各种创造性的方式用作优化器的提示。 So I did a little experiment所以我做了一个小实验

void stdswap(int& x, int& y)
{
    std::swap(x, y);
}

void brswap(int& x, int& y)
{
    if(&x == &y)
        __builtin_unreachable();
    x ^= y;
    y ^= x;
    x ^= y;
}

void rswap(int& __restrict x, int& __restrict y)
{
    x ^= y;
    y ^= x;
    x ^= y;
}

gets compiled to (g++ -O2)编译为(g++ -O2)

stdswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rsi], eax
        ret
brswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rsi], eax
        xor     DWORD PTR [rdi], eax
        ret
rswap(int&, int&):
        mov     eax, DWORD PTR [rsi]
        mov     edx, DWORD PTR [rdi]
        mov     DWORD PTR [rdi], eax
        mov     DWORD PTR [rsi], edx
        ret

I assume that stdswap and rswap is optimal from the optimizer's perspective.我假设从优化器的角度来看, stdswaprswap是最佳的。 Why doesn't brswap get compiled to the same thing?为什么brswap不被编译成同样的东西? Can I get it to compile to the same thing with __builtin_unreachable ?我可以用__builtin_unreachable让它编译成同样的东西吗?

The purpose of __builtin_unreachable is to help the compiler to remove dead code (that programmer knows will never be executed) and to linearize the code by letting compiler know that the path is "cold". __builtin_unreachable的目的是帮助编译器删除死代码(程序员知道永远不会被执行)并通过让编译器知道路径是“冷”来线性化代码。 Consider the following: 考虑以下:

void exit_if_true(bool x);

int foo1(bool x)
{
    if (x) {
        exit_if_true(true);
        //__builtin_unreachable(); // we do not enable it here
    } else {
        std::puts("reachable");
    }

    return 0;
}
int foo2(bool x)
{
    if (x) {
        exit_if_true(true);
        __builtin_unreachable();  // now compiler knows exit_if_true
                                  // will not return as we are passing true to it
    } else {
        std::puts("reachable");
    }

    return 0;
}

Generated code: 生成的代码:

foo1(bool):
        sub     rsp, 8
        test    dil, dil
        je      .L2              ; that jump is going to change
        mov     edi, 1
        call    exit_if_true(bool)
        xor     eax, eax         ; that tail is going to be removed
        add     rsp, 8
        ret
.L2:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
foo2(bool):
        sub     rsp, 8
        test    dil, dil
        jne     .L9              ; changed jump
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
.L9:
        mov     edi, 1
        call    exit_if_true(bool)

Notice the differences: 注意差异:

  • xor eax, eax and ret were removed as now compiler knows that is a dead code. xor eax, eaxret被删除,因为现在编译器知道这是一个死代码。
  • The compiler swapped the order of branches: branch with puts call now comes first so that conditional jump can be faster (forward branches that are not taken are faster both when predicted and when there is no prediction information). 编译器交换了分支的顺序:现在首先是分支与puts调用,因此条件跳转可以更快(未预测的前向分支在没有预测信息时更快)。

The assumption here is that branch that ends with noreturn function call or __builtin_unreachable will either be executed only once or leads to longjmp call or exception throw both of which are rare and do not need to be prioritized during optimization. 这里的假设是以noreturn函数调用或__builtin_unreachable结尾的分支将只执行一次或导致longjmp调用或异常抛出,这两种情况都很少见,并且在优化期间不需要优先处理。

You are trying to use it for a different purpose - by giving compiler information about aliasing (and you can try doing the same for alignment). 您正在尝试将其用于不同的目的 - 通过提供有关别名的编译器信息(您可以尝试对齐进行相同操作)。 Unfortunately GCC doesn't understand such address checks. 不幸的是,GCC不理解这种地址检查。

As you have noticed, adding __restrict__ helps. 正如您所注意到的那样,添加__restrict__会有所帮助。 So __restrict__ works for aliasing, __builtin_unreachable does not. 所以__restrict__适用于别名, __builtin_unreachable不适用。

Look at the following example that uses __builtin_assume_aligned : 请看以下使用__builtin_assume_aligned示例:

void copy1(int *__restrict__ dst, const int *__restrict__ src)
{
    if (reinterpret_cast<uintptr_t>(dst) % 16 == 0) __builtin_unreachable();
    if (reinterpret_cast<uintptr_t>(src) % 16 == 0) __builtin_unreachable();

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copy2(int *__restrict__ dst, const int *__restrict__ src)
{
    dst = static_cast<int *>(__builtin_assume_aligned(dst, 16));
    src = static_cast<const int *>(__builtin_assume_aligned(src, 16));

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

Generated code: 生成的代码:

copy1(int*, int const*):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi], xmm0
        ret
copy2(int*, int const*):
        movdqa  xmm0, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm0
        ret

You could assume that compiler can understand that dst % 16 == 0 means the pointer is 16-byte aligned, but it doesn't. 您可以假设编译器可以理解dst % 16 == 0表示指针是16字节对齐的,但事实并非如此。 So unaligned stores and loads are used, while the second version generates faster instructions that require address to be aligned. 因此使用未对齐的存储和加载,而第二个版本生成更快的指令,需要对齐地址。

I think you are the trying to micro-optimize your code wrong moving a wrong direction. 我认为你是试图微量优化你的代码错误的方向错误。

__builtin_unreachable as well as __builtin_expect doing what expected - in your case remove unnecessary cmp and jnz from unused if operator. __builtin_unreachable以及__builtin_expect执行预期的操作 - 在您的情况下jnz使用的if运算符中删除不必要的cmpjnz

Compiler should generate the machine code using C code you've write to produce predictable program. 编译器应该使用您编写的C代码生成机器代码,以生成可预测的程序。 And during optimization, it able to find and optimize (ie replace with better machine code version) some patterns, when it known to optimization algorithm - such optimization would not broke the program behavior. 在优化过程中,它能够找到并优化(即用更好的机器代码版本替换)一些模式,当优化算法已知时 - 这样的优化不会破坏程序行为。

Eg something like 比如像

char a[100];
for(int i=0; i < 100; i++)
   a[i]  = 0;

will be replaced single call to library std::memset(a,0,100) which is implemented using assembly, and optimal for the current CPU architecture. 将替换单个调用库std :: memset(a,0,100),它是使用汇编实现的,并且是当前CPU架构的最佳选择。

As well as compiler able to detect 以及编译器能够检测

x ^= y;
y ^= x;
x ^= y;

and replace it with simplest mashie code. 并用最简单的mashie代码替换它。

I think your if operator and unreached directive influenced the compiler optimizer, so that is can not optimize. 我认为你的if运算符和未达到的指令会影响编译器优化器,因此无法进行优化。

In case of swapping of two integers, 3-rd temporary exchange variable can be removed by compiler it self, ie it is going to be something like 在交换两个整数的情况下,第三个临时交换变量可以通过编译器自己删除,即它会像

movl    $2, %ebx
movl    $1, %eax
xchg    %eax,%ebx  

Where ebx and eax register values are actually your x and y. 其中ebx和eax寄存器值实际上是你的x和y。 You can implement it your self like 你可以像自己一样实现它

void swap_x86(int& x, int& y)
{
    __asm__ __volatile__( "xchg %%rax, %%rbx": "=a"(x), "=b"(y) : "a"(x), "b"(y) : );
}
...
int a = 1;
int b = 2;
swap_x86(a,b);

When to use __builtin_unreachable? 什么时候使用__builtin_unreachable? Probably when you known that some situation are practically impossible, but logically it may happens. 可能当你知道某些情况几乎不可能时,但逻辑上可能会发生。 Ie you have some function like 即你有一些功能

void foo(int v) {

    switch( v ) {
        case 0:
            break;
        case 1:
            break;
        case 2:
            break;
        case 3:
            break;
        default:
            __builtin_unreachable();
    }
}

And you know that v argument value is always between 0 and 3. However, int range is -2147483648 to 2147483647 (when int is 32 bit type), compiler have no idea about real values range and not able to remove the default block (as well as some cmp instructions etc), but it will warn you if you don't add this block into switch. 并且您知道v参数值始终在0和3之间。但是,int范围是-21474836482147483647 (当int是32位类型时),编译器不知道实际值范围并且无法删除默认块(如以及一些cmp指令等),但如果你不将这个块添加到交换机中,它会警告你。 So in this case __builtin_unreachable may help. 所以在这种情况下__builtin_unreachable可能有所帮助。

Probably when you known that some situation are practically impossible, but logically it may happens.可能当您知道某些情况实际上是不可能的,但从逻辑上讲它可能会发生。

Be very very careful about using __builtin_unreachable();使用 __builtin_unreachable() 时要非常小心; in default: case labels.默认情况下:案例标签。 I spent days trying to figure out random branches into .rodata areas.我花了几天时间试图找出 .rodata 区域的随机分支。 In my case, there were 0 through 25 case labels and a default: case, for an enum from 0 to 128. And a switch value of 71 that caused the problem.在我的例子中,有 0 到 25 个案例标签和一个默认值:案例,用于从 0 到 128 的枚举。以及导致问题的开关值 71。 6.3.0 had range checking code but 8.3.0 optimized it away. 6.3.0 有范围检查代码,但 8.3.0 对其进行了优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM