了解SSE3矩阵乘法优化

Question

With reference to http://blogs.msdn.com/b/xiangfan/archive/2009/04/28/optimize-your-code-matrix-multiplication.aspx . 参考http://blogs.msdn.com/b/xiangfan/archive/2009/04/28/optimize-your-code-matrix-multiplication.aspx 。

template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128 c = _mm_setzero_ps();

            for (int k = 0; k < size; k += 4) {
                c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
            }
            c = _mm_hadd_ps(c, c);
            c = _mm_hadd_ps(c, c);
            _mm_store_ss(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}

Why is there 2 more _mm_hadd_ps(c, c) after the inner most for loop? 为什么最里面的for循环之后还有2个_mm_hadd_ps(c, c) ？ To verify my understanding: this code loads 4 floats from m1 and another 4 from m2, then multiplies them resulting in 4 floats ( __m128 ). 为了验证我的理解：此代码从m1加载4个浮点，从m2加载另外4个浮点，然后将它们相乘得到4个浮点（ __m128 ）。 Then I sum them into c (at this point, its still 4 floats?). 然后我将它们加到c （此时，它还有4个浮点数？）。 Then after the for loop I hadd this result twice? 然后在for循环之后，我hadd结果hadd两次？ What does that do? 那是做什么的？

My code slightly re-written produces the wrong result it appears 我的代码被稍微重写了，导致出现错误的结果

long long start, end;
__m128 v1, v2, vMul, vRes;
vRes = _mm_setzero_ps();

start = wall_clock_time();
transpose_matrix(m2);
for (int i = 0; i < SIZE; i++) {
    for (int j = 0; j < SIZE; j++) {
        float tmp = 0;
        for (int k = 0; k < SIZE; k+=4) {
            v1 = _mm_load_ps(&m1[i][k]);
            v2 = _mm_load_ps(&m2[j][k]);
            vMul = _mm_mul_ps(v1, v2);

            vRes = _mm_add_ps(vRes, vMul);
        }
        vRes = _mm_hadd_ps(vRes, vRes);
        _mm_store_ss(&result[i][j], vRes);
    }
}
end = wall_clock_time();
fprintf(stderr, "Optimized Matrix multiplication took %1.2f seconds\n", ((float)(end - start))/1000000000);

// reverse the transposition
transpose_matrix(m2);

Answer 1

haddps doesn't sum all four elements in a vector. haddps不能将向量中的所有四个元素求和。 Two haddps instructions are needed to get the full horizontal sum. 需要两个haddps指令才能获得完整的水平和。

If we number the elements of the vector {c0,c1,c2,c3} , the first haddps produces {c0+c1, c2+c3, c0+c1, c2+c3} . 如果我们对向量{c0,c1,c2,c3}的元素编号，则第一个haddps会产生{c0+c1, c2+c3, c0+c1, c2+c3} 。 The second produces {c0+c1+c2+c3, <same thing in the other lanes>} . 第二个产生{c0+c1+c2+c3, <same thing in the other lanes>} 。

了解SSE3矩阵乘法优化

问题描述

1 个解决方案

解决方案1
4 已采纳 2012-10-03 23:47:48

了解SSE3矩阵乘法优化

问题描述

1 个解决方案

解决方案1 4 已采纳 2012-10-03 23:47:48

解决方案1
4 已采纳 2012-10-03 23:47:48