简体   繁体   中英

Forcing AVX intrinsics to use SSE instructions instead

Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions:

Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.

In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason.

I really want to code for the newest instruction set AVX though, while still being able to test builds on my machine at a reasonable speed. Is there a way to force mm256 intrinsics to use SSE instructions instead? I'm using VS 2015.

If there is no easy way, what about a hard way. Replace <immintrin.h> with a custom made header containing my own definitions for the intrinsics which can be coded to use SSE? Not sure how plausible this is, prefer easier way if possible before I go through that work.

Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__ .

Then use an AVX sized vector such as Vec8f for eight floats. When you compile without AVX enable it will use the file vectorf256e.h which emulates AVX with two SSE registers. For example Vec8f inherits from Vec256fe which starts like this:

class Vec256fe {
protected:
    __m128 y0;                         // low half
    __m128 y1;                         // high half

If you compile with /arch:AVX -D__XOP__ the VCL will instead use the file vectorf256.h and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.

If you don't want to use XOP don't use -D__XOP__ .


As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this

Vec8f a;
Vec4f lo = a.get_low();  // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]);         // b is a float array
hi.store(&b[4]);

then compile with /arch:AVX -D__XOP__ .

Another option would be be one source file that uses Vecnf and then do

//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif  

and compile like this

cl /O2 /DSIMDWIDTH=4                     foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX           foo.cpp /Fofoo_avx256

This would create three executables with one source file. Instead of linking them you could just compile them with /c and them make a CPU dispatcher. I used XOP with avx128 because I don't think there is a good reason to use avx128 except on AMD.

You don't want to use SSE instructions. What you want is for 256b stores to be done as two separate 128b stores, still with VEX-coded 128b instructions. ie 128b AVX vmovups .


gcc has -mavx256-split-unaligned-load and ...-store options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family ( -march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.


You could override the normal 256b store intrinsic with a macro like

// maybe enable this for all BD family CPUs?

#if defined(__bdver2) | defined(PILEDRIVER) | defined(SPLIT_256b_STORES)
   #define _mm256_storeu_ps(addr, data) do{ \
      _mm_storeu_ps( ((float*)(addr)) + 0, _mm256_extractf128_ps((data),0)); \
      _mm_storeu_ps( ((float*)(addr)) + 4, _mm256_extractf128_ps((data),1)); \
   }while(0)
#endif

gcc defines __bdver2 (Bulldozer version 2) for Piledriver ( -march=bdver2 ).

You could do the same for (aligned) _mm256_store_ps , or just always use the unaligned intrinsic.

Compilers optimize the _mm256_extractf128(data,0) to a simple cast. Ie it should just compile to

vmovups       [rdi], xmm0         ; if data is in xmm0 and addr is in rdi
vextractf128  [rdi+16], xmm0, 1

However, testing on godbolt shows that gcc and clang are dumb , and extract to a register and then store. ICC correctly generates the two-instruction sequence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM