Intel assembly vs Intrinsics, AVX

Question

I have a simple vector-vector addition algorithm (c = a + b * lambda) written in intel assembly, using AVX instructions. Here is my code:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Dense to dense
;; Uses cache
;; AVX
;; Without tolerances
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

global _denseToDenseAddAVX_cache_64_linux
_denseToDenseAddAVX_cache_64_linux:

push    rbp
mov     rbp, rsp

; rdi: address1
; rsi: address2
; rdx: address3
; rcx: count
; xmm0: lambda

mov     rax, rcx
shr     rcx, 3
and     rax, 0x07

vzeroupper

vmovupd  ymm5, [abs_mask]

sub     rsp, 8
vmovlpd  [rbp - 8], xmm0
vbroadcastsd    ymm7, [rbp - 8]
vmovapd     ymm6, ymm7

cmp     rcx, 0
je      after_loop_denseToDenseAddAVX_cache_64_linux

start_denseToDenseAddAVX_cache_64_linux:

vmovapd  ymm0, [rdi] ; a
vmovapd  ymm1, ymm7
vmulpd   ymm1, [rsi] ; b
vaddpd   ymm0, ymm1  ; ymm0 = c = a + b * lambda
vmovapd  [rdx], ymm0

vmovapd  ymm2, [rdi + 32] ; a
vmovapd  ymm3, ymm6
vmulpd   ymm3, [rsi + 32] ; b
vaddpd   ymm2, ymm3  ; ymm2 = c = a + b * lambda
vmovapd  [rdx + 32], ymm2

add     rdi, 64
add     rsi, 64
add     rdx, 64

loop    start_denseToDenseAddAVX_cache_64_linux

after_loop_denseToDenseAddAVX_cache_64_linux:

cmp     rax, 0
je      end_denseToDenseAddAVX_cache_64_linux

mov     rcx, rax

last_loop_denseToDenseAddAVX_cache_64_linux:

vmovlpd  xmm0, [rdi] ; a
vmovapd  xmm1, xmm7
vmulsd   xmm1, [rsi] ; b
vaddsd   xmm0, xmm1  ; xmm0 = c = a + b * lambda
vmovlpd  [rdx], xmm0

add     rdi, 8
add     rsi, 8
add     rdx, 8

loop    last_loop_denseToDenseAddAVX_cache_64_linux

end_denseToDenseAddAVX_cache_64_linux:

mov     rsp, rbp
pop     rbp
ret

People often suggest me to use intel intrinsics because it is much better and safer. Now I've implemented this algorithm as this:

void denseToDenseAddAVX_cache(const double * __restrict__ a, 
                              const double * __restrict__ b, 
                              double * __restrict__ c, 
                              size_t count, double lambda) {
    const size_t firstCount = count / 8;
    const size_t rem1 = count % 8;
    int i;
    __m256d mul = _mm256_broadcast_sd(&lambda);
    for (i = 0; i < firstCount; i++) {
        // c = a + b * lambda
        __m256d dataA1 = _mm256_load_pd(&a[i * 8]);
        __m256d dataC1 = _mm256_add_pd(dataA1, _mm256_mul_pd(_mm256_load_pd(&b[i * 8]), mul  ));
        _mm256_store_pd(&c[i * 8], dataC1);

        __m256d dataA2 = _mm256_load_pd(&a[i * 8 + 4]);
        __m256d dataC2 = _mm256_add_pd(dataA2, _mm256_mul_pd(_mm256_load_pd(&b[i * 8 + 4]), mul  ));
        _mm256_store_pd(&c[i * 8 + 4], dataC2);
    }
    const size_t secondCount = rem1 / 4;
    const size_t rem2 = rem1 % 4;
    if (secondCount) {
        __m256d dataA = _mm256_load_pd(&a[i * 8]);
        __m256d dataC = _mm256_add_pd(dataA, _mm256_mul_pd(_mm256_load_pd(&b[i * 8]), mul  ));
        _mm256_store_pd(&c[i * 8], dataC);
        i += 4;
    }
    for (; i < count; i++) {
        c[i] = a[i] + b[i] * lambda;
    }
}

My problem is that the assembly version is two times faster than the second one. What is the problem with the c++ version?

Answer 1

A few things.

I think this is the most important one. Assembly code uses pointers arithmetic. Your C++ code doesn't, you're computing indices first, then taking addresses. Compilers often optimize to pointer math but this is not reliable, you better use same pointer math in your C++. What's even worse, stuff like &a[i * 8 + 4] requires multiple integer instructions. The result in bytes is a+i*64+32, while x86 instructions can only scale integers for free by factors 2, 4 or 8. So the compiler has to emit left shift followed by add to compute the address. This issue doubles the count of instructions in the body of the loop.
C++ uses signed 32-bit integers for loop counters, assembly code uses unsigned 64-bit integers. For performance-critical code it's often a good idea to use size_t in C++ for loop counters. BTW, if you would have set up “warnings as errors” setting in your C++ compiler, it would refuse to compile, saying something like “signed/unsigned mismatch”.
You have redundant loads in C++. CPUs can do math + one load with a single instruction. To do the same as assembly does, don't use _mm256_load_pd , cast pointers from const double * into const __m256d*

Here's slightly simplified example:

void denseToDenseAddAVX( const double *a, const double *b, double *c, size_t count, double lambda )
{
    assert( 0 == (size_t)( a ) % 32 );
    assert( 0 == (size_t)( b ) % 32 );
    assert( 0 == (size_t)( c ) % 32 );

    const double* const aEnd = a + count;
    const double* const aEndAligned = a + ( ( count / 4 ) * 4 );
    const __m256d mul = _mm256_set1_pd( lambda );
    while( a < aEndAligned )
    {
        const __m256d* const av = ( const __m256d* )a;
        const __m256d* const bv = ( const __m256d* )b;
        const __m256d cv = _mm256_add_pd( *av, _mm256_mul_pd( *bv, mul ) );
        _mm256_store_pd( c, cv );
        a += 4;
        b += 4;
        c += 4;
    }
    while( a < aEnd )
    {
        *c = ( *a ) + ( *b ) * lambda;
        a++;
        b++;
        c++;
    }
}

Intel assembly vs Intrinsics, AVX

Question

1 answers

solution1
0 2019-12-02 10:44:54

Intel assembly vs Intrinsics, AVX

Question

1 answers

solution1 0 2019-12-02 10:44:54

solution1
0 2019-12-02 10:44:54