简体   繁体   中英

Will intel -03 convert pairs of __m256d instructions into __m512d

Will a code written for a 256 vectorization register will be compiled to use 512 instructions using the (2019) intel compiler with O3 level of optimization?

eg will operations on two __m256d objects be either converted to the same amount of operations over masked __m512d objects or grouped to make the most use out of the register, in the best case the total number of operations dropping by a factor 2?

arch: Knights Landing

Unfortunately, no : a code written to use AVX/AVX-2 intrinsics is not rewritten by ICC so to use AVX-512 yet (with both ICC 2019 and ICC 2021). There is no instruction fusing. Here is an example (see on GodBolt ).

 #include <x86intrin.h> void compute(double* restrict data, int size) { __m256d cst1 = _mm256_set1_pd(23.42); __m256d cst2 = _mm256_set1_pd(815.0); __m256d v1 = _mm256_load_pd(data); __m256d v2 = _mm256_load_pd(data+4); __m256d v3 = _mm256_load_pd(data+8); __m256d v4 = _mm256_load_pd(data+12); v1 = _mm256_fmadd_pd(v1, cst1, cst2); v2 = _mm256_fmadd_pd(v2, cst1, cst2); v3 = _mm256_fmadd_pd(v3, cst1, cst2); v4 = _mm256_fmadd_pd(v4, cst1, cst2); _mm256_store_pd(data, v1); _mm256_store_pd(data+4, v2); _mm256_store_pd(data+8, v3); _mm256_store_pd(data+12, v4); }

Generated code:

 compute: vmovupd ymm0, YMMWORD PTR.L_2il0floatpacket.0[rip] #5.20 vmovupd ymm1, YMMWORD PTR.L_2il0floatpacket.1[rip] #6.20 vmovupd ymm2, YMMWORD PTR [rdi] #7.33 vmovupd ymm3, YMMWORD PTR [32+rdi] #8.33 vmovupd ymm4, YMMWORD PTR [64+rdi] #9.33 vmovupd ymm5, YMMWORD PTR [96+rdi] #10.33 vfmadd213pd ymm2, ymm0, ymm1 #11.10 vfmadd213pd ymm3, ymm0, ymm1 #12.10 vfmadd213pd ymm4, ymm0, ymm1 #13.10 vfmadd213pd ymm5, ymm0, ymm1 #14.10 vmovupd YMMWORD PTR [rdi], ymm2 #15.21 vmovupd YMMWORD PTR [32+rdi], ymm3 #16.21 vmovupd YMMWORD PTR [64+rdi], ymm4 #17.21 vmovupd YMMWORD PTR [96+rdi], ymm5 #18.21 vzeroupper #19.1 ret #19.1

The same code is generated for both version of ICC.

Note that using AVX-512 should not always speed up your code by a factor of two. For example, on Skylake SP (server-side processors) there is 2 AVX/AVX-2 SIMD units that can be fused to execute AVX-512 instructions but fusing does not improve throughput (assuming the SIMD units are the bottleneck). However, Skylake SP also supports an optional additional 512-bits SIMD units that does not support AVX/AVX-2 (only available on some processors). In this case, AVX-512 can make your code twice faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM