[英]OpenJDK Panama Vector API jdk.incubator.vector not giving improved results for Vector dot product
I am testing OpenJDK Panama Vector API jdk.incubator.vector and I performed tests on amazon c5.4xlarge instance.我正在测试OpenJDK Panama Vector API jdk.incubator.vector,并在 amazon c5.4xlarge 实例上进行了测试。 But in every case simple unrolled vector dot product is out performing the Vector API code.但在每种情况下,简单展开矢量点积都无法执行 Vector API 代码。
My Question is: Why am I not able to get the performance boost as shown in Richard Startin's blog .我的问题是:为什么我无法获得Richard Startin 的博客中所示的性能提升。 The same performance improvement is also discussed in this conference meetup by Intel people.英特尔人员在这次会议聚会中也讨论了同样的性能改进。 What is missing?什么不见了?
JMH benchmark test results: JMH基准测试结果:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.unrolled 1048576 thrpt 25 2440.726 ? 21.372 ops/s
FloatVector256DotProduct.vanilla 1048576 thrpt 25 807.721 ? 0.084 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 25 909.977 ? 6.542 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 25 887.422 ? 5.557 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 25 916.955 ? 4.652 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 25 877.569 ? 1.451 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 25 2096.782 ? 6.778 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 25 1627.320 ? 6.824 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 25 2102.654 ? 11.637 ops/s
AWS instance type: c5.4xlarge
AWS 实例类型: c5.4xlarge
CPU details: CPU详细信息:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3404.362
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Code snippets.代码片段。 Please see complete code in this github repo请参阅此 github 存储库中的完整代码
JavaDocExample: This is shared in the java doc of vectorIntrinsic branch of OpenJDK. JavaDocExample:这在 OpenJDK 的 vectorIntrinsic 分支的 java 文档中共享。
@Benchmark
public void simpleMultiplyUnrolled() {
for (int i = 0; i < size; i += 8) {
c[i] = a[i] * b[i];
c[i + 1] = a[i + 1] * b[i + 1];
c[i + 2] = a[i + 2] * b[i + 2];
c[i + 3] = a[i + 3] * b[i + 3];
c[i + 4] = a[i + 4] * b[i + 4];
c[i + 5] = a[i + 5] * b[i + 5];
c[i + 6] = a[i + 6] * b[i + 6];
c[i + 7] = a[i + 7] * b[i + 7];
}
}
@Benchmark
public void simpleMultiply() {
for (int i = 0; i < size; i++) {
c[i] = a[i] * b[i];
}
}
@Benchmark
public void vectorMultiply() {
int i = 0;
// It is assumed array arguments are of the same size
for (; i < SPECIES.loopBound(a.length); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
FloatVector vc = va.mul(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = a[i] * b[i];
}
}
FloatVector256DotProduct: this code is shamelessly copied from Richard Startin's blog . FloatVector256DotProduct:此代码无耻地从Richard Startin 的博客中复制。 Thanks Richard for insightful blogs.感谢理查德富有洞察力的博客。
@Benchmark
public float vectorfma() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorfmaUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).fma(FloatVector.fromArray(F256, right, i), sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).fma(FloatVector.fromArray(F256, right, i + width), sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).fma(FloatVector.fromArray(F256, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).fma(FloatVector.fromArray(F256, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float vector() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.mul(r).add(sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).mul(FloatVector.fromArray(F256, right, i)).add(sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).mul(FloatVector.fromArray(F256, right, i + width)).add(sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).mul(FloatVector.fromArray(F256, right, i + width * 2)).add(sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).mul(FloatVector.fromArray(F256, right, i + width * 3)).add(sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float unrolled() {
float s0 = 0f;
float s1 = 0f;
float s2 = 0f;
float s3 = 0f;
float s4 = 0f;
float s5 = 0f;
float s6 = 0f;
float s7 = 0f;
for (int i = 0; i < size; i += 8) {
s0 = Math.fma(left[i + 0], right[i + 0], s0);
s1 = Math.fma(left[i + 1], right[i + 1], s1);
s2 = Math.fma(left[i + 2], right[i + 2], s2);
s3 = Math.fma(left[i + 3], right[i + 3], s3);
s4 = Math.fma(left[i + 4], right[i + 4], s4);
s5 = Math.fma(left[i + 5], right[i + 5], s5);
s6 = Math.fma(left[i + 6], right[i + 6], s6);
s7 = Math.fma(left[i + 7], right[i + 7], s7);
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
}
@Benchmark
public float vanilla() {
float sum = 0f;
for (int i = 0; i < size; ++i) {
sum = Math.fma(left[i], right[i], sum);
}
return sum;
}
Process followed to compile and use the OpenJDK Panama dev vectorIntrinsic branch as showed in this SO question如this SO question所示,编译和使用OpenJDK Panama dev vectorIntrinsic分支的过程
hg clone http://hg.openjdk.java.net/panama/dev/
cd dev/
hg checkout vectorIntrinsics
hg branch vectorIntrinsics
bash configure
make images
Things I checked why it should have worked.我检查了为什么它应该起作用的事情。
I came across this post which was answered in the by @iwanowww here: https://gist.github.com/iwanowww/221df8893fbaa4b6b0904e3036221b1d .我在这里遇到了@iwanowww 回答的这篇文章: https://gist.github.com/iwanowww/221df8893fbaa4b6b0904e3036221b1d 。 In short, this was a regression issue that was fixed since then, details below.简而言之,这是一个从那时起就修复的回归问题,详情如下。
TL;DR it is fixed now TL;DR 现在已修复
(1) The regression in FloatVector256DotProduct.vector* with latest vectorIntrinsics branch is caused by a bug in vector operations intrinsification: (1) 带有最新 vectorIntrinsics 分支的 FloatVector256DotProduct.vector* 中的回归是由向量运算内在化中的错误引起的:
2675 92 b net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes)
...
@ 3 jdk.incubator.vector.FloatVector::zero (35 bytes) force inline by annotation
@ 6 jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes) accessor
@ 13 jdk.incubator.vector.AbstractSpecies::length (5 bytes) accessor
@ 19 jdk.incubator.vector.FloatVector::toBits (6 bytes) force inline by annotation
@ 1 java.lang.Float::floatToIntBits (15 bytes) (intrinsic)
@ 23 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes) force inline by annotation
@ 28 jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes) failed to inline (intrinsic)
The following patch fixes the bug:以下补丁修复了该错误:
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -476,7 +476,7 @@
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.
if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt,
- is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) {
+ (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) {
if (C->print_intrinsics()) {
tty->print_cr(" ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d",
num_elem, type2name(elem_bt),
BEFORE:前:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 679.280 ± 13.731 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2319.770 ± 123.943 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 803.740 ± 42.596 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 797.153 ± 49.129 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 828.172 ± 16.936 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 798.037 ± 85.566 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1888.662 ± 55.922 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1486.322 ± 93.864 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1525.046 ± 110.700 ops/s
AFTER:后:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 666.581 ± 8.727 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2416.695 ± 106.223 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 3776.422 ± 117.357 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 3734.246 ± 122.463 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1914.794 ± 51.329 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1405.345 ± 52.025 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1832.133 ± 56.256 ops/s
(2) The regression in vectorfmaUnrolled (compared to vectorfma) is caused by well-known inlining issues which break vector box elimination: (2) vectorfmaUnrolled 中的回归(与 vectorfma 相比)是由众所周知的破坏矢量框消除的内联问题引起的:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
19727 95 b net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes)
...
@ 209 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 213 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 218 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
...
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3938.922 ± 97.041 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 1048576 thrpt 5 0.111 ± 0.003 B/op
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 2052.549 ± 68.859 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 1048576 thrpt 5 1573537.127 ± 22.886 B/op
Until the inlining is fixed, as a workaround, a warm-up phase with smaller data input can help:在修复内联之前,作为一种解决方法,具有较小数据输入的预热阶段可以帮助:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 128 thrpt 5 54838734.769 ± 161477.746 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
FloatVector256DotProduct.vectorfmaUnrolled 128 thrpt 5 68993637.658 ± 359974.720 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.