![](/img/trans.png)
[英]How to resolve incubator module jdk.incubator.vector when running Java application
[英]OpenJDK Panama Vector API jdk.incubator.vector not giving improved results for Vector dot product
我正在测试OpenJDK Panama Vector API jdk.incubator.vector,并在 amazon c5.4xlarge 实例上进行了测试。 但在每种情况下,简单展开矢量点积都无法执行 Vector API 代码。
我的问题是:为什么我无法获得Richard Startin 的博客中所示的性能提升。 英特尔人员在这次会议聚会中也讨论了同样的性能改进。 什么不见了?
JMH基准测试结果:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.unrolled 1048576 thrpt 25 2440.726 ? 21.372 ops/s
FloatVector256DotProduct.vanilla 1048576 thrpt 25 807.721 ? 0.084 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 25 909.977 ? 6.542 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 25 887.422 ? 5.557 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 25 916.955 ? 4.652 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 25 877.569 ? 1.451 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 25 2096.782 ? 6.778 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 25 1627.320 ? 6.824 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 25 2102.654 ? 11.637 ops/s
AWS 实例类型: c5.4xlarge
CPU详细信息:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3404.362
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
代码片段。 请参阅此 github 存储库中的完整代码
JavaDocExample:这在 OpenJDK 的 vectorIntrinsic 分支的 java 文档中共享。
@Benchmark
public void simpleMultiplyUnrolled() {
for (int i = 0; i < size; i += 8) {
c[i] = a[i] * b[i];
c[i + 1] = a[i + 1] * b[i + 1];
c[i + 2] = a[i + 2] * b[i + 2];
c[i + 3] = a[i + 3] * b[i + 3];
c[i + 4] = a[i + 4] * b[i + 4];
c[i + 5] = a[i + 5] * b[i + 5];
c[i + 6] = a[i + 6] * b[i + 6];
c[i + 7] = a[i + 7] * b[i + 7];
}
}
@Benchmark
public void simpleMultiply() {
for (int i = 0; i < size; i++) {
c[i] = a[i] * b[i];
}
}
@Benchmark
public void vectorMultiply() {
int i = 0;
// It is assumed array arguments are of the same size
for (; i < SPECIES.loopBound(a.length); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
FloatVector vc = va.mul(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = a[i] * b[i];
}
}
FloatVector256DotProduct:此代码无耻地从Richard Startin 的博客中复制。 感谢理查德富有洞察力的博客。
@Benchmark
public float vectorfma() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorfmaUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).fma(FloatVector.fromArray(F256, right, i), sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).fma(FloatVector.fromArray(F256, right, i + width), sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).fma(FloatVector.fromArray(F256, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).fma(FloatVector.fromArray(F256, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float vector() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.mul(r).add(sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).mul(FloatVector.fromArray(F256, right, i)).add(sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).mul(FloatVector.fromArray(F256, right, i + width)).add(sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).mul(FloatVector.fromArray(F256, right, i + width * 2)).add(sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).mul(FloatVector.fromArray(F256, right, i + width * 3)).add(sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float unrolled() {
float s0 = 0f;
float s1 = 0f;
float s2 = 0f;
float s3 = 0f;
float s4 = 0f;
float s5 = 0f;
float s6 = 0f;
float s7 = 0f;
for (int i = 0; i < size; i += 8) {
s0 = Math.fma(left[i + 0], right[i + 0], s0);
s1 = Math.fma(left[i + 1], right[i + 1], s1);
s2 = Math.fma(left[i + 2], right[i + 2], s2);
s3 = Math.fma(left[i + 3], right[i + 3], s3);
s4 = Math.fma(left[i + 4], right[i + 4], s4);
s5 = Math.fma(left[i + 5], right[i + 5], s5);
s6 = Math.fma(left[i + 6], right[i + 6], s6);
s7 = Math.fma(left[i + 7], right[i + 7], s7);
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
}
@Benchmark
public float vanilla() {
float sum = 0f;
for (int i = 0; i < size; ++i) {
sum = Math.fma(left[i], right[i], sum);
}
return sum;
}
如this SO question所示,编译和使用OpenJDK Panama dev vectorIntrinsic分支的过程
hg clone http://hg.openjdk.java.net/panama/dev/
cd dev/
hg checkout vectorIntrinsics
hg branch vectorIntrinsics
bash configure
make images
我检查了为什么它应该起作用的事情。
我在这里遇到了@iwanowww 回答的这篇文章: https://gist.github.com/iwanowww/221df8893fbaa4b6b0904e3036221b1d 。 简而言之,这是一个从那时起就修复的回归问题,详情如下。
TL;DR 现在已修复
(1) 带有最新 vectorIntrinsics 分支的 FloatVector256DotProduct.vector* 中的回归是由向量运算内在化中的错误引起的:
2675 92 b net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes)
...
@ 3 jdk.incubator.vector.FloatVector::zero (35 bytes) force inline by annotation
@ 6 jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes) accessor
@ 13 jdk.incubator.vector.AbstractSpecies::length (5 bytes) accessor
@ 19 jdk.incubator.vector.FloatVector::toBits (6 bytes) force inline by annotation
@ 1 java.lang.Float::floatToIntBits (15 bytes) (intrinsic)
@ 23 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes) force inline by annotation
@ 28 jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes) failed to inline (intrinsic)
以下补丁修复了该错误:
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -476,7 +476,7 @@
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.
if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt,
- is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) {
+ (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) {
if (C->print_intrinsics()) {
tty->print_cr(" ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d",
num_elem, type2name(elem_bt),
前:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 679.280 ± 13.731 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2319.770 ± 123.943 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 803.740 ± 42.596 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 797.153 ± 49.129 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 828.172 ± 16.936 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 798.037 ± 85.566 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1888.662 ± 55.922 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1486.322 ± 93.864 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1525.046 ± 110.700 ops/s
后:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 666.581 ± 8.727 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2416.695 ± 106.223 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 3776.422 ± 117.357 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 3734.246 ± 122.463 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1914.794 ± 51.329 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1405.345 ± 52.025 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1832.133 ± 56.256 ops/s
(2) vectorfmaUnrolled 中的回归(与 vectorfma 相比)是由众所周知的破坏矢量框消除的内联问题引起的:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
19727 95 b net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes)
...
@ 209 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 213 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 218 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
...
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3938.922 ± 97.041 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 1048576 thrpt 5 0.111 ± 0.003 B/op
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 2052.549 ± 68.859 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 1048576 thrpt 5 1573537.127 ± 22.886 B/op
在修复内联之前,作为一种解决方法,具有较小数据输入的预热阶段可以帮助:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 128 thrpt 5 54838734.769 ± 161477.746 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
FloatVector256DotProduct.vectorfmaUnrolled 128 thrpt 5 68993637.658 ± 359974.720 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.