[英]What kind of program can benefit much from LTO?
When using dhrystone to get DMIPS, I found that LTO greatly impacted the results.在使用 dhrystone 获得 DMIPS 时,我发现 LTO 对结果影响很大。 LTO-dhrystone is nearly 4x LTO-less-dhrystone:
LTO-dhrystone 几乎是 LTO-less-dhrystone 的 4 倍:
$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 5234421.7
VAX MIPS rating = 2979.181
Performance counter stats for './dhrystone':
19,158.53 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
547 page-faults:u # 28.551 /sec
81,470,643,102 cycles:u # 4.252 GHz (50.01%)
3,046,747 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.02%)
37,208,106,969 stalled-cycles-backend:u # 45.67% backend cycles idle (50.00%)
319,848,969,156 instructions:u # 3.93 insn per cycle
# 0.12 stalled cycles per insn (49.99%)
49,311,879,609 branches:u # 2.574 G/sec (49.98%)
317,518 branch-misses:u # 0.00% of all branches (50.00%)
19.762244278 seconds time elapsed
19.118127000 seconds user
0.004017000 seconds sys
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 19539623.0
VAX MIPS rating = 11121.015
Performance counter stats for './dhrystone':
5,146.69 msec task-clock:u # 0.908 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
553 page-faults:u # 107.448 /sec
21,453,263,692 cycles:u # 4.168 GHz (50.00%)
1,574,543 stalled-cycles-frontend:u # 0.01% frontend cycles idle (50.03%)
12,575,396,819 stalled-cycles-backend:u # 58.62% backend cycles idle (50.04%)
89,186,371,586 instructions:u # 4.16 insn per cycle
# 0.14 stalled cycles per insn (50.00%)
7,717,732,872 branches:u # 1.500 G/sec (49.97%)
353,303 branch-misses:u # 0.00% of all branches (49.96%)
5.666446006 seconds time elapsed
5.133037000 seconds user
0.003322000 seconds sys
As you can see如你看到的
1953,9623.0
and LTO-less dhrystone is 523,4421.7
1953,9623.0
,LTO-less dhrystone 为523,4421.7
89,186,371,586
instructions and LTO-less dhrystone executes 319,848,969,156
89,186,371,586
条指令,LTO-less dhrystone 执行319,848,969,156
I think the root cause is that LTO reduces many instructions, so it can run much faster.我认为根本原因是 LTO 减少了很多指令,所以它可以运行得更快。
But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.但是当我运行像 coremark/coremark-pro 这样的基准测试时,LTO 与非 LTO 相比没有明显的改进。
LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h
for inlining normally, LTO can greatly simplify code that does a lot of calling such functions. LTO 允许跨文件内联,因此如果您有微小的辅助函数(如类中的 C++ get/set 函数)在
.h
中通常不可见以用于正常内联,LTO 可以大大简化执行大量调用此类函数的代码.
A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl
and ret
instructions.一个简单的 get 或 set 包装器可以内联到零指令(object 数据只存在于寄存器中),但是 call/ret 需要在寄存器中传递一个 arg,更不用说执行实际的
bl
和ret
指令了。 And would have to respect the calling convention, so the call-site might need to mov
some values to call-preserved registers.并且必须遵守调用约定,因此调用站点可能需要将
mov
值移动到调用保留寄存器。 But when inlining, the compiler has full control over all the registers.但是在内联时,编译器可以完全控制所有寄存器。
For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations.对于基准测试,将工作与重复循环放在一个单独的文件中是阻止编译器通过跨重复循环迭代优化来击败基准测试的好方法。 (eg hoisting work out of loops instead of re-computing something every time.)
(例如,提升工作在循环之外,而不是每次都重新计算一些东西。)
Unless you use LTO so it can break your benchmarks.除非你使用 LTO,否则它会打破你的基准。 (Or maybe there's another reason with dhrystone, IDK.)
(或者也许还有另一个原因与 dhrystone,IDK。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.