Why __m256 instead of 'float' gives more than x8 performance?

Question

Why am I getting such a huge speedup (x16 times) by using __m256 datatype? There is 8 floating points processed at a time, so I would expect to only see x8 speedup?

My CPU is 4-core Devil Canyon i7 (has hyperthreading) Compiling with visual studio 2017 in Release mode -O2 optimizations turned on.

The fast version consumes 0.000151 seconds on 400x400 matrix:

//make this matrix only keep the signs of its entries
inline void to_signs() {

    __m256 *i = reinterpret_cast<__m256*>(_arrays);
    __m256 *end = reinterpret_cast<__m256*>(_arrays + arraysSize());

    __m256 maskPlus = _mm256_set1_ps(1.f);
    __m256 maskMin =  _mm256_set1_ps(-1.f);

    //process the main portion of the array.  NOTICE: size might not be divisible by 8:
    while(true){
        ++i;
        if(i > end){  break; }

        __m256 *prev_i = i-1;
        *prev_i = _mm256_min_ps(*prev_i, maskPlus);
        *prev_i = _mm256_max_ps(*prev_i, maskMin);
    }

    //process the few remaining numbers, at the end of the array:
    i--;
    for(float *j=(float*)i; j<_arrays+arraysSize(); ++j){
        //taken from here:http://www.musicdsp.org/showone.php?id=249
        // mask sign bit in f, set it in r if necessary:
        float r = 1.0f;
        (int&)r |= ((int&)(*j) & 0x80000000);//according to author, can end up either -1 or 1 if zero.
        *j = r;
    }
}

the older version, runs at 0.002416 seconds:

inline void to_signs_slow() {
    size_t size = arraysSize();

    for (size_t i = 0; i<size; ++i) {
        //taken from here:http://www.musicdsp.org/showone.php?id=249
        // mask sign bit in f, set it in r if necessary:

        float r = 1.0f;
        (int&)r |= ((int&)_arrays[i] & 0x80000000);//according to author, can end up either -1 or 1 if zero.
        _arrays[i] = r;
    }
}

Is it secretly using 2 cores, so this benefit will vanish once I start using multithreading?

Edit:

On larger matrix, with size (10e6)x(4e4) I am getting 3 and 14 seconds on average. So a mere x4 speedup, not even x8 This is probably due to memory bandwidth, and things not fitting in cache

Still, my question is about the pleasant x16 speedup surprise :)

Answer 1

Your scalar version looks horrible (with reference-casting for type-punning), and probably compiles to really inefficient asm that's a lot slower than copying of each 32-bit element into the bit-pattern for 1.0f . That should only take an integer AND and an OR to do it scalar (if MSVC failed to auto-vectorize that for you), but I wouldn't be surprised if the compiler is copying it to an XMM register or something.

Your first manually-vectorized version doesn't even do the same work, though, it just masks away all the non-sign bits to leave -0.0f or +0.0f . So it will compile to one vandps ymm0, ymm7, [rdi] and one SIMD store with vmovups [rdi], ymm0 , plus some loop overhead.

Not that adding in _mm256_or_ps with set1(1.0f) would slow it down any, you'd still bottleneck on cache bandwidth or 1-per-clock store throughput.

Then you edited it to a version that clamps into the -1.0f .. +1.0f range, leaving inputs with magnitude less than 1.0 unmodified. That's not going to be slower than two bitwise ops, except that Haswell (devil's canyon) only runs FP booleans on port 5, vs. actual FP stuff on port 0 or port 1.

Especially if you're not doing anything else with your floats, you'll actually want to use _si256 intrinsics to use just AVX2 integer instructions on them, for more speed on Haswell. (But then your code can't run without AVX2.)

On Skylake and newer, FP booleans can use all 3 vector ALU ports. ( https://agner.org/optimize/ for instruction tables and uarch guide.)

Your code should look something like:

// outside the loop if you want
const __m256i ones = _mm256_castps_si256(_mm256_set1_ps(1.0f));

for (something ; p += whatever) {
    __m256i floats = _mm256_load_si256( (const __m256i*)p );
    __m256i signs = _mm256_and_si256(floats,  _mm256_set1_epi32(0x80000000));
    __m256i applied = _mm256_or_si256(signs, ones);
    _mm256_store_si256((__m256i*)p, applied);

}

Why __m256 instead of 'float' gives more than x8 performance?

Question

1 answers

solution1
4 ACCPTED 2019-01-28 20:49:17

Why __m256 instead of 'float' gives more than x8 performance?

Question

1 answers

solution1 4 ACCPTED 2019-01-28 20:49:17

solution1
4 ACCPTED 2019-01-28 20:49:17