簡體   English   中英

OpenJDK Panama Vector API jdk.incubator.vector 沒有為 Vector 點積提供改進的結果

[英]OpenJDK Panama Vector API jdk.incubator.vector not giving improved results for Vector dot product

我正在測試OpenJDK Panama Vector API jdk.incubator.vector,並在 amazon c5.4xlarge 實例上進行了測試。 但在每種情況下,簡單展開矢量點積都無法執行 Vector API 代碼。

我的問題是:為什么我無法獲得Richard Startin 的博客中所示的性能提升。 英特爾人員在這次會議聚會中也討論了同樣的性能改進。 什么不見了?

JMH基准測試結果:

Benchmark                                              (size)   Mode  Cnt      Score    Error  Units

FloatVector256DotProduct.unrolled                       1048576  thrpt   25   2440.726 ? 21.372  ops/s
FloatVector256DotProduct.vanilla                        1048576  thrpt   25    807.721 ?  0.084  ops/s
FloatVector256DotProduct.vector                         1048576  thrpt   25    909.977 ?  6.542  ops/s
FloatVector256DotProduct.vectorUnrolled                 1048576  thrpt   25    887.422 ?  5.557  ops/s
FloatVector256DotProduct.vectorfma                      1048576  thrpt   25    916.955 ?  4.652  ops/s
FloatVector256DotProduct.vectorfmaUnrolled              1048576  thrpt   25    877.569 ?  1.451  ops/s

JavaDocExample.simpleMultiply                           1048576  thrpt   25  2096.782 ?  6.778  ops/s
JavaDocExample.simpleMultiplyUnrolled                   1048576  thrpt   25  1627.320 ?  6.824  ops/s
JavaDocExample.vectorMultiply                           1048576  thrpt   25  2102.654 ? 11.637  ops/s

AWS 實例類型: c5.4xlarge

CPU詳細信息:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:            4
CPU MHz:             3404.362
BogoMIPS:            5999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

代碼片段。 請參閱此 github 存儲庫中的完整代碼

JavaDocExample:這在 OpenJDK 的 vectorIntrinsic 分支的 java 文檔中共享。

    @Benchmark
    public void simpleMultiplyUnrolled() {
        for (int i = 0; i < size; i += 8) {
            c[i] = a[i] * b[i];
            c[i + 1] = a[i + 1] * b[i + 1];
            c[i + 2] = a[i + 2] * b[i + 2];
            c[i + 3] = a[i + 3] * b[i + 3];
            c[i + 4] = a[i + 4] * b[i + 4];
            c[i + 5] = a[i + 5] * b[i + 5];
            c[i + 6] = a[i + 6] * b[i + 6];
            c[i + 7] = a[i + 7] * b[i + 7];
        }
    }

    @Benchmark
    public void simpleMultiply() {
        for (int i = 0; i < size; i++) {
            c[i] = a[i] * b[i];
        }
    }

    @Benchmark
    public void vectorMultiply() {
        int i = 0;
        // It is assumed array arguments are of the same size
        for (; i < SPECIES.loopBound(a.length); i += SPECIES.length()) {
            FloatVector va = FloatVector.fromArray(SPECIES, a, i);
            FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
            FloatVector vc = va.mul(vb);
            vc.intoArray(c, i);
        }

        for (; i < a.length; i++) {
            c[i] = a[i] * b[i];
        }
    }

FloatVector256DotProduct:此代碼無恥地從Richard Startin 的博客中復制。 感謝理查德富有洞察力的博客。

  @Benchmark
  public float vectorfma() {
    var sum = FloatVector.zero(F256);
    for (int i = 0; i < size; i += F256.length()) {
      var l = FloatVector.fromArray(F256, left, i);
      var r = FloatVector.fromArray(F256, right, i);
      sum = l.fma(r, sum);
    }
    return sum.reduceLanes(ADD);
  }

  @Benchmark
  public float vectorfmaUnrolled() {
    var sum1 = FloatVector.zero(F256);
    var sum2 = FloatVector.zero(F256);
    var sum3 = FloatVector.zero(F256);
    var sum4 = FloatVector.zero(F256);
    int width = F256.length();
    for (int i = 0; i < size; i += width * 4) {
      sum1 = FloatVector.fromArray(F256, left, i).fma(FloatVector.fromArray(F256, right, i), sum1);
      sum2 = FloatVector.fromArray(F256, left, i + width).fma(FloatVector.fromArray(F256, right, i + width), sum2);
      sum3 = FloatVector.fromArray(F256, left, i + width * 2).fma(FloatVector.fromArray(F256, right, i + width * 2), sum3);
      sum4 = FloatVector.fromArray(F256, left, i + width * 3).fma(FloatVector.fromArray(F256, right, i + width * 3), sum4);
    }
    return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
  }

  @Benchmark
  public float vector() {
    var sum = FloatVector.zero(F256);
    for (int i = 0; i < size; i += F256.length()) {
      var l = FloatVector.fromArray(F256, left, i);
      var r = FloatVector.fromArray(F256, right, i);
      sum = l.mul(r).add(sum);
    }
    return sum.reduceLanes(ADD);
  }

  @Benchmark
  public float vectorUnrolled() {
    var sum1 = FloatVector.zero(F256);
    var sum2 = FloatVector.zero(F256);
    var sum3 = FloatVector.zero(F256);
    var sum4 = FloatVector.zero(F256);
    int width = F256.length();
    for (int i = 0; i < size; i += width * 4) {
      sum1 = FloatVector.fromArray(F256, left, i).mul(FloatVector.fromArray(F256, right, i)).add(sum1);
      sum2 = FloatVector.fromArray(F256, left, i + width).mul(FloatVector.fromArray(F256, right, i + width)).add(sum2);
      sum3 = FloatVector.fromArray(F256, left, i + width * 2).mul(FloatVector.fromArray(F256, right, i + width * 2)).add(sum3);
      sum4 = FloatVector.fromArray(F256, left, i + width * 3).mul(FloatVector.fromArray(F256, right, i + width * 3)).add(sum4);
    }
    return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
  }

  @Benchmark
  public float unrolled() {
    float s0 = 0f;
    float s1 = 0f;
    float s2 = 0f;
    float s3 = 0f;
    float s4 = 0f;
    float s5 = 0f;
    float s6 = 0f;
    float s7 = 0f;
    for (int i = 0; i < size; i += 8) {
      s0 = Math.fma(left[i + 0],  right[i + 0], s0);
      s1 = Math.fma(left[i + 1],  right[i + 1], s1);
      s2 = Math.fma(left[i + 2],  right[i + 2], s2);
      s3 = Math.fma(left[i + 3],  right[i + 3], s3);
      s4 = Math.fma(left[i + 4],  right[i + 4], s4);
      s5 = Math.fma(left[i + 5],  right[i + 5], s5);
      s6 = Math.fma(left[i + 6],  right[i + 6], s6);
      s7 = Math.fma(left[i + 7],  right[i + 7], s7);
    }
    return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
  }

  @Benchmark
  public float vanilla() {
    float sum = 0f;
    for (int i = 0; i < size; ++i) {
      sum = Math.fma(left[i], right[i], sum);
    }
    return sum;
  }

this SO question所示,編譯和使用OpenJDK Panama dev vectorIntrinsic分支的過程

hg clone http://hg.openjdk.java.net/panama/dev/
cd dev/
hg checkout vectorIntrinsics
hg branch vectorIntrinsics
bash configure
make images

我檢查了為什么它應該起作用的事情。

  1. lscpu 顯示各種 avx 標志。
  2. 我選擇了應該支持 AVX 指令集的 HVM AMI。 https://aws.amazon.com/ec2/instance-types/ 說:† AVX、AVX2 和增強聯網僅適用於使用 HVM AMI 啟動的實例。
  3. 我可以編譯矢量代碼,這意味着我正在使用 OpenJDK 的適當分支。 我使用 --add-modules=jdk.incubator.vector VM 參數運行我的代碼。 我還在 [this intel blog](https://software.intel.com/en-us/articles/vector-api-developer-program-for-java) 中添加了一些其他 VM 參數,例如 state:-XX:TypeProfileLevel= 121
  4. 我檢查了它確實包含 vmulps 指令的編譯匯編代碼。 雖然很難找到它們,因為我在向量 api 代碼中調用方法,並且乘法發生在調用的 mul/fma 方法中的其他一些地方。
  5. 我已經使用 64、128、256、512 等不同的 SIZE 以及使用“FloatVector.SPECIES_PREFERRED”進行了更多測試。 在所有情況下,向量 api 代碼都明顯慢於展開的簡單乘法代碼。

我在這里遇到了@iwanowww 回答的這篇文章: https://gist.github.com/iwanowww/221df8893fbaa4b6b0904e3036221b1d 簡而言之,這是一個從那時起就修復的回歸問題,詳情如下。

TL;DR 現在已修復

(1) 帶有最新 vectorIntrinsics 分支的 FloatVector256DotProduct.vector* 中的回歸是由向量運算內在化中的錯誤引起的:

   2675   92    b        net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes)
   ...
                            @ 3   jdk.incubator.vector.FloatVector::zero (35 bytes)   force inline by annotation
                              @ 6   jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes)   accessor
                              @ 13   jdk.incubator.vector.AbstractSpecies::length (5 bytes)   accessor
                              @ 19   jdk.incubator.vector.FloatVector::toBits (6 bytes)   force inline by annotation
                                @ 1   java.lang.Float::floatToIntBits (15 bytes)   (intrinsic)
                              @ 23   java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes)   force inline by annotation
                                @ 4   java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes)   force inline by annotation
                              @ 28   jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes)   failed to inline (intrinsic)

以下補丁修復了該錯誤:

diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -476,7 +476,7 @@

   // TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.
   if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt,
-                            is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) {
+                            (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) {
     if (C->print_intrinsics()) {
       tty->print_cr("  ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d",
                     num_elem, type2name(elem_bt),

前:

Benchmark                                    (size)   Mode  Cnt     Score     Error  Units
FloatVector256DotProduct.vanilla            1048576  thrpt    5   679.280 ±  13.731  ops/s
FloatVector256DotProduct.unrolled           1048576  thrpt    5  2319.770 ± 123.943  ops/s
FloatVector256DotProduct.vector             1048576  thrpt    5   803.740 ±  42.596  ops/s
FloatVector256DotProduct.vectorUnrolled     1048576  thrpt    5   797.153 ±  49.129  ops/s
FloatVector256DotProduct.vectorfma          1048576  thrpt    5   828.172 ±  16.936  ops/s
FloatVector256DotProduct.vectorfmaUnrolled  1048576  thrpt    5   798.037 ±  85.566  ops/s
JavaDocExample.simpleMultiply               1048576  thrpt    5  1888.662 ±  55.922  ops/s
JavaDocExample.simpleMultiplyUnrolled       1048576  thrpt    5  1486.322 ±  93.864  ops/s
JavaDocExample.vectorMultiply               1048576  thrpt    5  1525.046 ± 110.700  ops/s

后:

Benchmark                                    (size)   Mode  Cnt     Score     Error  Units
FloatVector256DotProduct.vanilla            1048576  thrpt    5   666.581 ±   8.727  ops/s
FloatVector256DotProduct.unrolled           1048576  thrpt    5  2416.695 ± 106.223  ops/s
FloatVector256DotProduct.vector             1048576  thrpt    5  3776.422 ± 117.357  ops/s
FloatVector256DotProduct.vectorUnrolled     1048576  thrpt    5  3734.246 ± 122.463  ops/s
FloatVector256DotProduct.vectorfma          1048576  thrpt    5  3804.485 ±  44.797  ops/s
FloatVector256DotProduct.vectorfmaUnrolled  1048576  thrpt    5  1158.018 ±  15.955  ops/s
JavaDocExample.simpleMultiply               1048576  thrpt    5  1914.794 ±  51.329  ops/s
JavaDocExample.simpleMultiplyUnrolled       1048576  thrpt    5  1405.345 ±  52.025  ops/s
JavaDocExample.vectorMultiply               1048576  thrpt    5  1832.133 ±  56.256  ops/s

(2) vectorfmaUnrolled 中的回歸(與 vectorfma 相比)是由眾所周知的破壞矢量框消除的內聯問題引起的:

Benchmark                                    (size)   Mode  Cnt     Score     Error  Units
FloatVector256DotProduct.vectorfma          1048576  thrpt    5  3804.485 ±  44.797  ops/s
FloatVector256DotProduct.vectorfmaUnrolled  1048576  thrpt    5  1158.018 ±  15.955  ops/s

19727   95    b        net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes)
    ...
    @ 209   jdk.incubator.vector.FloatVector::add (9 bytes)   force inline by annotation
      @ 5   jdk.incubator.vector.FloatVector::lanewise (0 bytes)   virtual call
    @ 213   jdk.incubator.vector.FloatVector::add (9 bytes)   force inline by annotation
      @ 5   jdk.incubator.vector.FloatVector::lanewise (0 bytes)   virtual call
    @ 218   jdk.incubator.vector.FloatVector::add (9 bytes)   force inline by annotation
      @ 5   jdk.incubator.vector.FloatVector::lanewise (0 bytes)   virtual call
    ...

Benchmark                                                                     (size)   Mode  Cnt        Score        Error   Units
FloatVector256DotProduct.vectorfma                                           1048576  thrpt    5     3938.922 ±     97.041   ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm                       1048576  thrpt    5        0.111 ±      0.003    B/op

FloatVector256DotProduct.vectorfmaUnrolled                                   1048576  thrpt    5     2052.549 ±     68.859   ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm               1048576  thrpt    5  1573537.127 ±     22.886    B/op

在修復內聯之前,作為一種解決方法,具有較小數據輸入的預熱階段可以幫助:

Benchmark                                                       (size)   Mode  Cnt         Score        Error   Units
FloatVector256DotProduct.vectorfma                                 128  thrpt    5  54838734.769 ± 161477.746   ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm             128  thrpt    5        ≈ 10⁻⁵                 B/op

FloatVector256DotProduct.vectorfmaUnrolled                         128  thrpt    5  68993637.658 ± 359974.720   ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm     128  thrpt    5        ≈ 10⁻⁵                 B/op

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM