使用 SSE 的矩阵向量和矩阵矩阵乘法

Question

I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands.我需要编写矩阵向量和矩阵矩阵乘法函数，但我无法理解 SSE 命令。

The dimensions of matrices and vectors are always multiples of 4.矩阵和向量的维度总是 4 的倍数。

I managed to write the vector-vector multiplication function that looks like this:我设法编写了如下所示的向量-向量乘法函数：

void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size)
{
    int i;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_n = (__m128*)n;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < size / 4; ++i)
        p_result[i] = _mm_mul_ps(p_m[i], p_n[i]);

    // print the result
    for (int i = 0; i < size; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

and now I'm trying to implement matrix-vector multiplication.现在我正在尝试实现矩阵向量乘法。

Here's what I have so far:这是我到目前为止所拥有的：

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; i += 4)
    {
        __m128 tmp = _mm_load_ps(&result[i]);
        __m128 p_m_tmp = _mm_load_ps(&m[i]);

        tmp = _mm_add_ps(tmp, _mm_mul_ps(tmp, p_m_tmp));
        _mm_store_ps(&result[i], tmp);

        // another for loop here? 
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

This function looks completely wrong.这个函数看起来完全错误。 I mean not only it doesn't work correctly, but it also seems that I'm moving in the wrong direction.我的意思是不仅它不能正常工作，而且似乎我正朝着错误的方向前进。

Could anyone help me with implementing vector-matrix and matrix-matrix multiplication?谁能帮我实现向量矩阵和矩阵矩阵乘法？ I'd really appreciate some piece of example code and a very detailed explanation我真的很感激一些示例代码和非常详细的解释

Update更新

Here's my attempt number 2:这是我的第 2 次尝试：

it fails with Access reading violation exception but still feels closer它因Access reading violation异常而失败，但仍然感觉更接近

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        p_result[i] = _mm_mul_ps(_mm_load_ps(&m[i]), _mm_load_ps1(&v[i]));
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

Update 2更新 2

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;
    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        for (j = 0; j < vector_dims * vector_dims / 4; ++j)
        {
            p_result[i] = _mm_mul_ps(p_v[i], p_m[j]);
        }
    }

    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
    cout << endl;
}

Answer 1

Without any tricks or anything, a matrix-vector multiplication is just a bunch of dot products between the vector and a row of the matrix.没有任何技巧或任何东西，矩阵向量乘法只是向量和矩阵行之间的一堆点积。 Your code doesn't really have that structure.你的代码并没有真正的结构。 Writing it actually as dot products (not tested):实际上将其编写为点积（未测试）：

for (int row = 0; row < nrows; ++row) {
    __m128 acc = _mm_setzero_ps();
    // I'm just going to assume the number of columns is a multiple of 4
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        // don't forget it's a matrix, do 2d addressing
        __m128 mat = _mm_load_ps(&m[col + ncols * row]);
        acc = _mm_add_ps(acc, _mm_mul_ps(mat, vec));
    }
    // now we have 4 floats in acc and they have to be summed
    // can use two horizontal adds for this, they kind of suck but this
    // isn't the inner loop anyway.
    acc = _mm_hadd_ps(acc, acc);
    acc = _mm_hadd_ps(acc, acc);
    // store result, which is a single float
    _mm_store_ss(&result[row], acc);
}

There are some obvious tricks, such as processing several rows at once, reusing the load from the vector, and creating several independent dependency chains so you can make better use of the throughput (see below).有一些明显的技巧，例如一次处理多行，重用向量的负载，以及创建几个独立的依赖链，以便您可以更好地利用吞吐量（见下文）。 Also a really simple trick is using FMA for the mul/add combo, but support is not that widespread yet (it wasn't in 2015, but it is fairly widespread now in 2020).还有一个非常简单的技巧是将 FMA 用于 mul/add 组合，但支持还没有那么普遍（2015 年还没有，但现在在 2020 年相当普遍）。

You can build matrix-matrix multiplication from this (if you change the place the result goes), but that is not optimal (see further below).您可以从中构建矩阵-矩阵乘法（如果您更改结果的位置），但这不是最佳的（见下文）。

Taking four rows at once (not tested):一次取四行（未测试）：

for (int row = 0; row < nrows; row += 4) {
    __m128 acc0 = _mm_setzero_ps();
    __m128 acc1 = _mm_setzero_ps();
    __m128 acc2 = _mm_setzero_ps();
    __m128 acc3 = _mm_setzero_ps();
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        __m128 mat0 = _mm_load_ps(&m[col + ncols * row]);
        __m128 mat1 = _mm_load_ps(&m[col + ncols * (row + 1)]);
        __m128 mat2 = _mm_load_ps(&m[col + ncols * (row + 2)]);
        __m128 mat3 = _mm_load_ps(&m[col + ncols * (row + 3)]);
        acc0 = _mm_add_ps(acc0, _mm_mul_ps(mat0, vec));
        acc1 = _mm_add_ps(acc1, _mm_mul_ps(mat1, vec));
        acc2 = _mm_add_ps(acc2, _mm_mul_ps(mat2, vec));
        acc3 = _mm_add_ps(acc3, _mm_mul_ps(mat3, vec));
    }
    acc0 = _mm_hadd_ps(acc0, acc1);
    acc2 = _mm_hadd_ps(acc2, acc3);
    acc0 = _mm_hadd_ps(acc0, acc2);
    _mm_store_ps(&result[row], acc0);
}

There are only 5 loads per 4 FMAs now, versus 2 loads per 1 FMA in the version that wasn't row-unrolled.现在每 4 个 FMA 只有 5 个加载，而在未行展开的版本中，每 1 个 FMA 加载 2 个。 Also there are 4 independent FMAs, or add/mul pairs without FMA contraction, either way it increases the potential for pipelined/simultaneous execution.还有 4 个独立的 FMA，或添加/多对没有 FMA 收缩，无论哪种方式都增加了流水线/同时执行的潜力。 Actually you might want to unroll even more, for example Skylake can start 2 independent FMAs per cycle and they take 4 cycles to complete, so to completely occupy both FMA units you need 8 independent FMAs.实际上，您可能还想展开更多，例如 Skylake 每个周期可以启动 2 个独立的 FMA，它们需要 4 个周期才能完成，因此要完全占用两个 FMA 单元，您需要 8 个独立的 FMA。 As a bonus, those 3 horizontal adds in the end work out relatively nicely, for horizontal summation.作为奖励，这 3 个水平添加最终效果相对较好，用于水平求和。

The different data layout initially seems like a disadvantage, it's no longer possible to simply do vector-loads from both the matrix and the vector and multiply them together (that would multiply a tiny row vector of the first matrix by a tiny row vector of the second matrix again, which is wrong).不同的数据布局最初似乎是一个缺点，不再可能简单地从矩阵和向量中进行向量加载并将它们相乘（这会将第一个矩阵的微小行向量乘以矩阵的微小行向量）又是第二个矩阵，这是错误的）。 But full matrix-matrix multiplication can make use of the fact that it's essentially multiplying a matrix by lots of independent vectors, it's full of independent work to be done.但是完整的矩阵-矩阵乘法可以利用这样一个事实，即它本质上是将一个矩阵乘以许多独立的向量，它充满了独立的工作要做。 The horizontal sums can be avoided easily too.水平总和也可以很容易地避免。 So actually it's even more convenient than matrix-vector multiplication.所以实际上它比矩阵向量乘法更方便。

The key is taking a little column vector from matrix A and a little row vector from matrix B and multiplying them out into a small matrix.关键是从矩阵 A 中取出一个小列向量，从矩阵 B 中取出一个小行向量，然后将它们相乘成一个小矩阵。 That may sound reversed compared to what you're used to, but doing it this way works out better with SIMD because the computations stay independent and horizontal-operation-free the whole time.与您习惯的做法相比，这听起来可能相反，但使用 SIMD 以这种方式执行此操作效果更好，因为计算始终保持独立且无水平操作。

For example (not tested, assumes the matrixes have dimensions divisible by the unroll factors, requires x64 otherwise it runs out of registers)例如（未测试，假设矩阵的维度可以被展开因子整除，需要 x64，否则寄存器用完）

for (size_t i = 0; i < mat1rows; i += 4) {
    for (size_t j = 0; j < mat2cols; j += 8) {
        float* mat1ptr = &mat1[i * mat1cols];
        __m256 sumA_1, sumB_1, sumA_2, sumB_2, sumA_3, sumB_3, sumA_4, sumB_4;
        sumA_1 = _mm_setzero_ps();
        sumB_1 = _mm_setzero_ps();
        sumA_2 = _mm_setzero_ps();
        sumB_2 = _mm_setzero_ps();
        sumA_3 = _mm_setzero_ps();
        sumB_3 = _mm_setzero_ps();
        sumA_4 = _mm_setzero_ps();
        sumB_4 = _mm_setzero_ps();

        for (size_t k = 0; k < mat2rows; ++k) {
            auto bc_mat1_1 = _mm_set1_ps(mat1ptr[0]);
            auto vecA_mat2 = _mm_load_ps(mat2 + m2idx);
            auto vecB_mat2 = _mm_load_ps(mat2 + m2idx + 4);
            sumA_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecA_mat2), sumA_1);
            sumB_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecB_mat2), sumB_1);
            auto bc_mat1_2 = _mm_set1_ps(mat1ptr[N]);
            sumA_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecA_mat2), sumA_2);
            sumB_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecB_mat2), sumB_2);
            auto bc_mat1_3 = _mm_set1_ps(mat1ptr[N * 2]);
            sumA_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecA_mat2), sumA_3);
            sumB_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecB_mat2), sumB_3);
            auto bc_mat1_4 = _mm_set1_ps(mat1ptr[N * 3]);
            sumA_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecA_mat2), sumA_4);
            sumB_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecB_mat2), sumB_4);
            m2idx += 8;
            mat1ptr++;
        }
        _mm_store_ps(&result[i * mat2cols + j], sumA_1);
        _mm_store_ps(&result[i * mat2cols + j + 4], sumB_1);
        _mm_store_ps(&result[(i + 1) * mat2cols + j], sumA_2);
        _mm_store_ps(&result[(i + 1) * mat2cols + j + 4], sumB_2);
        _mm_store_ps(&result[(i + 2) * mat2cols + j], sumA_3);
        _mm_store_ps(&result[(i + 2) * mat2cols + j + 4], sumB_3);
        _mm_store_ps(&result[(i + 3) * mat2cols + j], sumA_4);
        _mm_store_ps(&result[(i + 3) * mat2cols + j + 4], sumB_4);
    }
}

The point of that code is that it's easy to arrange to computation to be very SIMD-friendly, with a lots of independent arithmetic to saturate the floating point units with, and at the same time use relatively few loads (which otherwise could become a bottleneck, even putting aside that they might miss L1 cache, just by there being too many of them).该代码的重点是很容易将计算安排为非常 SIMD 友好的，有很多独立的算法来饱和浮点单元，同时使用相对较少的负载（否则可能成为瓶颈，甚至抛开它们可能会错过 L1 缓存，只是因为它们太多）。

You can even use this code, but it's not competitive with Intel MKL.您甚至可以使用此代码，但它无法与英特尔 MKL 竞争。 Especially for medium or big matrixes, where tiling is extremely important.特别是对于中型或大型矩阵，平铺非常重要。 It's easy to upgrade this to AVX.将其升级到 AVX 很容易。 It's not suitable for tiny matrixes at all, for example to multiply two 4x4 matrixes see Efficient 4x4 matrix multiplication .它根本不适合小矩阵，例如将两个 4x4 矩阵相乘，请参阅Efficient 4x4 matrix multiplication 。

使用 SSE 的矩阵向量和矩阵矩阵乘法

问题描述

Update更新

Update 2更新 2

1 个解决方案

解决方案1
8 已采纳 2015-11-19 19:54:09

使用 SSE 的矩阵向量和矩阵矩阵乘法

问题描述

Update更新

Update 2更新 2

1 个解决方案

解决方案1 8 已采纳 2015-11-19 19:54:09

解决方案1
8 已采纳 2015-11-19 19:54:09