什么样的程序可以从 LTO 中受益匪浅？

Question

When using dhrystone to get DMIPS, I found that LTO greatly impacted the results.在使用 dhrystone 获得 DMIPS 时，我发现 LTO 对结果影响很大。 LTO-dhrystone is nearly 4x LTO-less-dhrystone: LTO-dhrystone 几乎是 LTO-less-dhrystone 的 4 倍：

$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src

without LTO无 LTO

$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static  dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected?  YES
Microseconds for one run through Dhrystone:     0.2 
Dhrystones per Second:                       5234421.7 
VAX MIPS rating =   2979.181 


 Performance counter stats for './dhrystone':

         19,158.53 msec task-clock:u              #    0.969 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               547      page-faults:u             #   28.551 /sec                   
    81,470,643,102      cycles:u                  #    4.252 GHz                      (50.01%)
         3,046,747      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (50.02%)
    37,208,106,969      stalled-cycles-backend:u  #   45.67% backend cycles idle      (50.00%)
   319,848,969,156      instructions:u            #    3.93  insn per cycle         
                                                  #    0.12  stalled cycles per insn  (49.99%)
    49,311,879,609      branches:u                #    2.574 G/sec                    (49.98%)
           317,518      branch-misses:u           #    0.00% of all branches          (50.00%)

      19.762244278 seconds time elapsed

      19.118127000 seconds user
       0.004017000 seconds sys

With LTO带 LTO

$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static  dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected?  YES
Microseconds for one run through Dhrystone:     0.1 
Dhrystones per Second:                      19539623.0 
VAX MIPS rating =  11121.015 


 Performance counter stats for './dhrystone':

          5,146.69 msec task-clock:u              #    0.908 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               553      page-faults:u             #  107.448 /sec                   
    21,453,263,692      cycles:u                  #    4.168 GHz                      (50.00%)
         1,574,543      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (50.03%)
    12,575,396,819      stalled-cycles-backend:u  #   58.62% backend cycles idle      (50.04%)
    89,186,371,586      instructions:u            #    4.16  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (50.00%)
     7,717,732,872      branches:u                #    1.500 G/sec                    (49.97%)
           353,303      branch-misses:u           #    0.00% of all branches          (49.96%)

       5.666446006 seconds time elapsed

       5.133037000 seconds user
       0.003322000 seconds sys

As you can see如你看到的

LTO dhrystone DMIPS is 1953,9623.0 and LTO-less dhrystone is 523,4421.7 LTO dhrystone DMIPS 为1953,9623.0 ，LTO-less dhrystone 为523,4421.7
LTO dhrystone executes 89,186,371,586 instructions and LTO-less dhrystone executes 319,848,969,156 LTO dhrystone 执行89,186,371,586条指令，LTO-less dhrystone 执行319,848,969,156

I think the root cause is that LTO reduces many instructions, so it can run much faster.我认为根本原因是 LTO 减少了很多指令，所以它可以运行得更快。

But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.但是当我运行像 coremark/coremark-pro 这样的基准测试时，LTO 与非 LTO 相比没有明显的改进。

Qeustion问题

What kind of programs are more easily affected by LTO optimization?什么样的程序更容易受到LTO优化的影响？ Why LTO has a big impact on dhrystone, but not on coremark/coremark-pro.为什么LTO对dhrystone影响大，对coremark/coremark-pro影响不大。
How does LTO reduce runtime instructions? LTO 如何减少运行时指令？

Answer 1

LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h for inlining normally, LTO can greatly simplify code that does a lot of calling such functions. LTO 允许跨文件内联，因此如果您有微小的辅助函数（如类中的 C++ get/set 函数）在.h中通常不可见以用于正常内联，LTO 可以大大简化执行大量调用此类函数的代码.

A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl and ret instructions.一个简单的 get 或 set 包装器可以内联到零指令（object 数据只存在于寄存器中），但是 call/ret 需要在寄存器中传递一个 arg，更不用说执行实际的bl和ret指令了。 And would have to respect the calling convention, so the call-site might need to mov some values to call-preserved registers.并且必须遵守调用约定，因此调用站点可能需要将mov值移动到调用保留寄存器。 But when inlining, the compiler has full control over all the registers.但是在内联时，编译器可以完全控制所有寄存器。

For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations.对于基准测试，将工作与重复循环放在一个单独的文件中是阻止编译器通过跨重复循环迭代优化来击败基准测试的好方法。 (eg hoisting work out of loops instead of re-computing something every time.) （例如，提升工作在循环之外，而不是每次都重新计算一些东西。）

Unless you use LTO so it can break your benchmarks.除非你使用 LTO，否则它会打破你的基准。 (Or maybe there's another reason with dhrystone, IDK.) （或者也许还有另一个原因与 dhrystone，IDK。）

什么样的程序可以从 LTO 中受益匪浅？

问题描述

without LTO无 LTO

With LTO带 LTO

Qeustion问题

1 个解决方案

解决方案1
1 已采纳 2022-10-08 18:36:27

什么样的程序可以从 LTO 中受益匪浅？

问题描述

without LTO无 LTO

With LTO带 LTO

Qeustion问题

1 个解决方案

解决方案1 1 已采纳 2022-10-08 18:36:27

解决方案1
1 已采纳 2022-10-08 18:36:27