Calculating matrix product is much slower with SSE than with straight-forward-algorithm

Question

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template <typename T>
void multiplicate_straight(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            for(int g = 0; g < sizeX; g++)
            {
                C[i][j] += A[i][g]*D[j][g];
            }
        }
    }
    FreeDynamicArray2D<T>(D);
}

and one time via using SSE functions. For this I created two functions:

template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{

    __m128d a;
    __m128d b;
    __m128d c;
#ifdef linux
    double A2[2], B2[2], C[2] __attribute__ ((aligned(16)));
#endif
#ifdef _WIN32
    __declspec(align(16)) double A2[2], B2[2], C[2];
#endif
    for(int i = 0; i < size; i+=2)
    {
        //std::cout << "In SSE_vectormult: i is: " << i << '\n';
        A2[0] = A[i];
        B2[0] = B[i];
        A2[1] = A[i+1];
        B2[1] = B[i+1];
        //std::cout << "Values from A and B written to A2 and B2\n";
        a = _mm_load_pd(A2);
        b = _mm_load_pd(B2);
        //std::cout << "Values converted to a and b\n";
        c = _mm_mul_pd(a,b);
        _mm_store_pd(C, c);
        A[i] = C[0];
        A[i+1] = C[1];
    };
}

and

template <typename T>
void multiplicate_SSE(T ** A, T ** B, T ** C, int sizeX)
{
//    std::cout << "Entered SSE-Function\n";
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    T * tmp = AllocateDynamicArray1D<T>(sizeX);
    T * tmp2 = AllocateDynamicArray1D<T>(sizeX);
    //std::cout << "Matrices allocated\n";
    transpose_matrix<T>(B, D,sizeX);
    //std::cout << "Matrix B transposed\n";
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            extract_row<T>(A,tmp, i, sizeX);
//            std::cout << "Row from A extracted\n";
            //print_vector(tmp, sizeX);
            extract_row<T>(D, tmp2, j, sizeX);
//            std::cout << "Row from D extracted\n";
            //print_vector(tmp2, sizeX);
            SSE_vectormult<T>(tmp, tmp2, sizeX);
//            std::cout << "Vectors multiplicated\n";
            //print_vector(tmp, sizeX);
            C[i][j] = add_vector(tmp, sizeX);
//            std::cout << "Written value to C\n";
//            std::cout << "j is " << j << " and i is " << i << '\n';
        }
    }
//    std::cout << "Loop finished\n";
    FreeDynamicArray2D<T>(D);
    //std::cout << "Freed D\n";
    //FreeDynamicArray1D<T>(tmp);????
//    std::cout << "Freed tmp\n";
    FreeDynamicArray1D<T>(tmp2);
//    std::cout << "Everything freed, returning\n";
}

But then I get several problems: On the one hand, when I want to free the tmp-array in multiplicate_SSE(), marked with several question marks, I get the error "_BLOCK_TYPE_IS_VALID". I thought about the possibility of freeing the same space twice, so I uncommented this (but as I suppose I get a memory leak through this?). Now, when I compare the performance of both functions with identical matrices, the SSE function needs around four times longer for two 1024x1024-matrices than the straight-forward-method.
How can I rewrite my SSE-Function to get a better performance (I have never worked with SSE before), and how can I fix my memory leak?
Thank you!

Answer 1

You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.

Let's stick to float (SGEMM). What you want to do with SSE is do four dot products at once. You want C = A*B . Let's look at a 8x8 matrix. Let's assume B is:

(0   1  2  3) ( 4  5  6  7)
(8   9 10 11) (12 13 14 15) 
(16 17 18 19) (20 21 22 23)
(24 25 26 27) (28 29 30 31)
(32 33 34 35) (36 37 38 39)
(40 41 42 43) (44 45 46 47)
(48 49 50 51) (52 53 54 55)
(56 57 58 59) (60 61 62 63)

So with SSE you do:

C[0][0] C[0][1] C[0][2] C[0][3] = 
A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)

That gets you four dot products at once. The problem is that you have to move down a column in B and the values are not in the same cache line. It would be better if each column of width four was contiguous in memory. So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

(0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
(4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)

If you think of each of the four values in parentheses as one unit this is equivalent to transposing a 8x2 matrix to a 2x8 matrix. Notice now that the columns of width four of B are contiguous in memory. This is far more cache friendly. For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is. See the code below for how to do this. For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix). For double the width is two with SSE and four with AVX.

This should be four times faster than the scalar code assuming the matrices fit in the cache. However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

However, if you use loop tiling and rearranging the matrix in tiles (which fit in the L2 cache) rather than for the whole matrix matrix multiplication is computation bound and not memory bound even for very large matrices that don't fit in the L3 cache. That's another topic.

Edit: some (untested) code to compare to your scalar code. I unrolled the loop by 2.

void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
    const int simd_width = 4;
    const int unroll = 2;
    const int strip_width = simd_width*unroll
    float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
    transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
    for(int i = 0; i < sizeX; i++) {
        for(int j = 0; j < sizeX; j+=strip_width) {
            float4 out_v1 = 0; //broadcast (0,0,0,0)
            float4 out_V2 = 0;
            //now calculate eight dot products
            for(int g = 0; g < sizeX; g++) {
                //load eight values rrom D into two SSE registers
                float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                out_v1 += A[i][g]*vec4_v1;
                out_v2 += A[i][g]*vec4_v2;
            }
            //store eight dot prodcuts into C
            out_v1.store(&C[i*sizeX + j]);
            out_v2.store(&C[i*sizeX + j + simd_width]);
        }
    }
    _mm_free(D);
}

void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
    //#pragma omp parallel for
    for(int n=0; n<N*N; n++) {
        int k = strip_width*(n/N/strip_width);
        int i = (n/strip_width)%N;
        int j = n%strip_width;
        B[n] = A[N*i + k + j];
    }
}

Notice that j increments by eight now. More unrolling may help. If you want to use intrinsics you can use _mm_load_ps , _mm_store_ps , _mm_set1_ps (for the broadcasts eg _mm_set1_ps(A[i][g]) ), _mm_add_ps , and _mm_mul_ps . That's it.

Answer 2

I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.

You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction .

template <typename T>
void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            __m128d temp = _mm_setzero_pd();
            for(int g = 0; g < sizeX; g += 2)
            {
                __m128d a = _mm_load_pd(&A[i][g]);
                __m128d b = _mm_load_pd(&D[j][g]);
                temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
            }
            // Add top and bottom half of temp together
            temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
            _mm_store_sd(temp, &C[i][j]); // Store one value
        }
    }
    FreeDynamicArray2D<T>(D);
}

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

Question

2 answers

solution1
4 ACCPTED

solution2
1 2013-12-07 01:04:58

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

Question

2 answers

solution1 4 ACCPTED

solution2 1 2013-12-07 01:04:58

solution1
4 ACCPTED

solution2
1 2013-12-07 01:04:58