[英]Java Math.abs(int) optimizations, why this code 6x times slower?
As you may know, Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE
and to prevent a negative value, the safeAbs
method was implemented in my project: 如您所知,
Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE
并为了防止出现负值,在我的项目中实现了safeAbs
方法:
public static int safeAbs(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
I compared the performance with the following one: 我将性能与以下各项进行了比较:
public static int safeAbs(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
And the first one is almost 6x times slower than the second (the second one performance is almost the same as "pure" Math.abs(int)). 并且第一个比第二个慢几乎6倍(第二个性能几乎与“纯” Math.abs(int)相同)。 From my point of view, there is no significant difference in bytecode, but I guess the difference is present in the JIT "assembly" code:
从我的角度来看,字节码没有显着差异,但是我猜想差异存在于JIT“汇编”代码中:
"slow" version: “慢”版本:
0x00007f0149119720: mov %eax,0xfffffffffffec000(%rsp)
0x00007f0149119727: push %rbp
0x00007f0149119728: sub $0x20,%rsp
0x00007f014911972c: test %esi,%esi
0x00007f014911972e: jl 0x7f0149119734
0x00007f0149119730: mov %esi,%eax
0x00007f0149119732: jmp 0x7f014911973c
0x00007f0149119734: neg %esi
0x00007f0149119736: test %esi,%esi
0x00007f0149119738: jl 0x7f0149119748
0x00007f014911973a: mov %esi,%eax
0x00007f014911973c: add $0x20,%rsp
0x00007f0149119740: pop %rbp
0x00007f0149119741: test %eax,0x1772e8b9(%rip) ; {poll_return}
0x00007f0149119747: retq
0x00007f0149119748: mov %esi,(%rsp)
0x00007f014911974b: mov $0xffffff65,%esi
0x00007f0149119750: nop
0x00007f0149119753: callq 0x7f01490051a0 ; OopMap{off=56}
;*ifge
; - math.FastAbs::safeAbsSlow@6 (line 16)
; {runtime_call}
0x00007f0149119758: callq 0x7f015f521d20 ; {runtime_call}
"normal" version: “普通”版本:
# {method} {0x00007f31acf28cd8} 'safeAbsFast' '(I)I' in 'math/FastAbs'
# parm0: rsi = int
# [sp+0x30] (sp of caller)
0x00007f31b08c7360: mov %eax,0xfffffffffffec000(%rsp)
0x00007f31b08c7367: push %rbp
0x00007f31b08c7368: sub $0x20,%rsp
0x00007f31b08c736c: cmp $0x80000000,%esi
0x00007f31b08c7372: je 0x7f31b08c738e
0x00007f31b08c7374: mov %esi,%r10d
0x00007f31b08c7377: neg %r10d
0x00007f31b08c737a: test %esi,%esi
0x00007f31b08c737c: mov %esi,%eax
0x00007f31b08c737e: cmovl %r10d,%eax
0x00007f31b08c7382: add $0x20,%rsp
0x00007f31b08c7386: pop %rbp
0x00007f31b08c7387: test %eax,0x162c2c73(%rip) ; {poll_return}
0x00007f31b08c738d: retq
0x00007f31b08c738e: mov %esi,(%rsp)
0x00007f31b08c7391: mov $0xffffff65,%esi
0x00007f31b08c7396: nop
0x00007f31b08c7397: callq 0x7f31b07b11a0 ; OopMap{off=60}
;*if_icmpne
; - math.FastAbs::safeAbsFast@3 (line 17)
; {runtime_call}
0x00007f31b08c739c: callq 0x7f31c5863d20 ; {runtime_call}
Benchmark code: 基准代码:
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = {"-Xms3g", "-Xmx3g", "-server"})
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class SafeAbsMicroBench {
@State(Scope.Benchmark)
public static class Data {
final int len = 10_000_000;
final int[] values = new int[len];
@Setup(Level.Trial)
public void setup() {
// preparing 10 million random integers without MIN_VALUE
for (int i = 0; i < len; i++) {
int val;
do {
val = ThreadLocalRandom.current().nextInt();
} while (val == Integer.MIN_VALUE);
values[i] = val;
}
}
}
@Benchmark
public int safeAbsSlow(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsSlow(data.values[i]);
return sum;
}
@Benchmark
public int safeAbsFast(Data data) {
int sum = 0;
for (int i = 0; i < data.len; i++)
sum += safeAbsFast(data.values[i]);
return sum;
}
private int safeAbsSlow(int i) {
i = Math.abs(i);
return i < 0 ? 0 : i;
}
private int safeAbsFast(int i) {
return i == Integer.MIN_VALUE ? 0 : Math.abs(i);
}
public static void main(String[] args) throws RunnerException {
final Options options = new OptionsBuilder()
.include(SafeAbsMicroBench.class.getSimpleName())
.build();
new Runner(options).run();
}
}
Results (Linux x86-64, 7820HQ, checked on oracle jdk 8 and 11 with pretty similar results). 结果(Linux x86-64、7820HQ在oracle jdk 8和11上检查,结果非常相似)。
Benchmark Mode Cnt Score Error Units
SafeAbsMicroBench.safeAbsFast avgt 10 6435155.516 ± 47130.767 ns/op
SafeAbsMicroBench.safeAbsSlow avgt 10 35646411.744 ± 776173.621 ns/op
Can someone explain why the first code is significantly slower than the second one? 有人可以解释为什么第一个代码比第二个要慢得多吗?
There is a difference in the generated native code for the safeAbsSlow
and safeAbsFast
methods. safeAbsSlow
和safeAbsFast
方法的生成本机代码有所不同。
safeAbsSlow
(C2, level 4): safeAbsSlow
(C2,级别4):
0x0000023d12ec4b14: add eax,ecx
0x0000023d12ec4b16: inc ebx
0x0000023d12ec4b18: cmp ebx,989680h
0x0000023d12ec4b1e: jnl 23d12ec4b4eh ; jump if `ebx` was not less than `10_000_000`
0x0000023d12ec4b20: mov ecx,dword ptr [r9+rbx*4+10h]
0x0000023d12ec4b25: test ecx,ecx
0x0000023d12ec4b27: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
0x0000023d12ec4b29: neg ecx
0x0000023d12ec4b2b: test ecx,ecx
0x0000023d12ec4b2d: jnl 23d12ec4b14h ; jump if `ecx` was not less-than `0`
safeAbsFast
(C2, level 4): safeAbsFast
(C2,级别4):
0x000001d89e8a4b20: mov ecx,dword ptr [r9+rdi*4+10h]
0x000001d89e8a4b25: cmp ecx,80000000h
0x000001d89e8a4b2b: je 1d89e8a4b66h ; jump if `ecx` was equal to `2147483648`
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
0x000001d89e8a4b39: add eax,ecx
0x000001d89e8a4b3b: inc edi
0x000001d89e8a4b3d: cmp edi,989680h
0x000001d89e8a4b43: jl 1d89e8a4b20h ; jump if `edi` was less than `10_000_000`
As we can see from the above, safeAbsSlow
has more conditional jumps than safeAbsFast
. 从上面我们可以看到,
safeAbsSlow
比safeAbsFast
具有更多的条件跳转。
This is particularly because the Math.abs
implementation which is inlined into the safeAbsFast
has no conditional jumps: 尤其是因为内嵌到
safeAbsFast
的Math.abs
实现没有条件跳转:
0x000001d89e8a4b2d: mov r11d,ecx
0x000001d89e8a4b30: neg r11d
0x000001d89e8a4b33: test ecx,ecx
0x000001d89e8a4b35: cmovl ecx,r11d
As a result, there are many more branch-misses in the slow
version in comparison to the normal
version when the data set has both positive and negative values that are scattered across an array. 结果,当数据集同时具有正值和负值且散布在整个数组中时,
slow
版本与normal
版本相比会存在更多分支丢失。 Below is the corresponding statistic that was collected using the perf
Linux profiler: 以下是使用
perf
Linux分析器收集的相应统计信息:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9611659.726 ± 1429082.431 ns/op
safeAbsFast:branch-misses avgt 2869.853 #/op
safeAbsFast:branches avgt 12492918.020 #/op
safeAbsFast:cycles avgt 28212203.936 #/op
safeAbsFast:instructions avgt 92352048.153 #/op
safeAbsSlow avgt 10 44524180.366 ± 6324887.086 ns/op
safeAbsSlow:branch-misses avgt 5006493.144 #/op
safeAbsSlow:branches avgt 17496069.911 #/op
safeAbsSlow:cycles avgt 126413171.674 #/op
safeAbsSlow:instructions avgt 67549877.558 #/op
In contrast, here is the result for the sorted data set: 相反,这是排序后的数据集的结果:
Benchmark Mode Cnt Score Error Units
safeAbsFast avgt 10 9026800.584 ± 528992.157 ns/op
safeAbsFast:branch-misses avgt 2785.463 #/op
safeAbsFast:branches avgt 12474751.905 #/op
safeAbsFast:cycles avgt 27379727.603 #/op
safeAbsFast:instructions avgt 92418075.715 #/op
safeAbsSlow avgt 10 6981828.374 ± 2375480.834 ns/op
safeAbsSlow:branch-misses avgt 2801.022 #/op
safeAbsSlow:branches avgt 17496585.992 #/op
safeAbsSlow:cycles avgt 19478382.113 #/op
safeAbsSlow:instructions avgt 67589946.278 #/op
The previously slow
version becomes even faster when the data set is sorted (costly branch-misses are minimized in this case). 当对数据集进行排序时,以前
slow
版本变得更快(在这种情况下,将代价高昂的分支丢失最小化)。
Environment: 环境:
openjdk version "12-internal" 2019-03-19
OpenJDK Runtime Environment (slowdebug build 12-internal+0-adhoc.jdk12)
OpenJDK 64-Bit Server VM (slowdebug build 12-internal+0-adhoc.jdk12, mixed mode)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.