[英]Matrix multiplication using SSE intrinsics
I am trying to do Matrix multiplication using SSE. 我正在尝试使用SSE进行Matrix乘法。 I have written a simple program for 4x4 matrices.
我为4x4矩阵编写了一个简单的程序。 Everything seems fine but when I print result , its some garbage values.
一切似乎都很好但是当我打印结果时,它的一些垃圾值。 please help to figure out problem/s.
请帮助找出问题/ s。 Secondly program stops working when I free memory, not a proper end of program.
其次程序在我释放内存时停止工作,而不是正确的程序结束。
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <float.h>
#include <xmmintrin.h>
void main() {
float **a, **b, **c;
int a_r = 4, a_c = 4, b_c = 4, b_r = 4;
int i, j, k;
/* allocate memory for matrix one */
a = (float **)malloc(sizeof(float) * a_r);
for (i = 0; i < a_c; i++) {
a[i] = (float *)malloc(sizeof(float) * a_c);
}
/* allocate memory for matrix two */
b = (float **)malloc(sizeof(float) * b_r);
for (i = 0; i < b_c; i++) {
b[i] = (float *)malloc(sizeof(float) * b_c);
}
/* allocate memory for sum matrix */
c = (float **)malloc(sizeof(float) * a_r);
for (i = 0; i < b_c; i++) {
c[i] = (float *)malloc(sizeof(float) * b_c);
}
printf("Initializing matrices...\n");
//initializing first matrix
for (i = 0; i < a_r; i++) {
for (j = 0; j < a_c; j++) {
a[i][j] = 2;
}
}
// initializing second matrix
for (i = 0; i < b_r; i++) {
for (j = 0; j < b_c; j++) {
b[i][j] = 2;
}
}
/* initialize product matrix */
for (i = 0; i < a_r; i++) {
for (j = 0; j < b_c; j++) {
c[i][j] = 0;
}
}
int count = 0;
/* multiply matrix one and matrix two */
for (i = 0; i < a_r; i++) {
for (j = 0; j < a_c; j++) {
count = 0;
__m128 result = _mm_setzero_ps();
for (k = 0; k < 4; k += 4) {
__m128 row1 = _mm_loadu_ps(&a[i][k]);
__m128 row2 = _mm_loadu_ps(&b[k][j]);
result = _mm_mul_ps(row1, row2);
for (int t = 1; t < 4; t++) {
__m128 row3=_mm_loadu_ps(&a[t * 4]);
__m128 row4=_mm_loadu_ps(&b[i][t]);
__m128 row5 = _mm_mul_ps(row3,row4);
result = _mm_add_ps(row5, result);
}
_mm_storeu_ps(&c[i][j], result);
}
}
}
printf("******************************************************\n");
printf ("Done.\n");
for (i = 0; i < a_r ; i++) {
for (j = 0; j < b_c; j++) {
printf ("%f ", c[i][j]); // issue here when I print results.
}
printf("\n");
} // Here program stops working.
/*free memory*/
for (i = 0; i < a_r; i++) {
free(a[i]);
}
free(a);
for (i = 0; i < a_c; i++) {
free(b[i]);
}
free(b);
for (i = 0; i < b_c; i++) {
free(c[i]);
}
free(c);
}
please have look at address printed for output matrix. 请查看为输出矩阵打印的地址。 how to get aligned addresses, I have
_aligned_malloc
, but still not aligned. 如何获得对齐的地址,我有
_aligned_malloc
,但仍然没有对齐。
The allocation for the matrix indirect pointers is incorrect. 矩阵间接指针的分配是不正确的。 it should read:
它应该是:
a = (float **)malloc(sizeof(float*) * a_r);
A safer way to write these allocations is this: 编写这些分配的更安全的方法是:
a = malloc(sizeof(*a) * a_r);
Note that you could allocate 2D matrices directly: 请注意,您可以直接分配2D矩阵:
float (*a)[4][4] = malloc(sizeof(*a));
Or better, as Cody Gray suggested: 或者更好,正如Cody Gray建议的那样:
float (*a)[4][4] = _aligned_malloc(sizeof(*a));
_aligned_malloc
is a non standard function that ensures proper alignment for SSE operands. _aligned_malloc
是一个非标准函数,可确保SSE操作数的正确对齐。
If fact you probably do not even need to allocate these matrices with malloc()
: 如果您甚至不需要使用
malloc()
分配这些矩阵:
float a[4][4];
But with this latter choice, you must ensure proper alignment for the SSE operations to succeed. 但是对于后一种选择,您必须确保正确对齐SSE操作才能成功。
The rest of the code has other problems: 其余代码还有其他问题:
void main()
is incorrect. void main()
不正确。 It should be int main(void)
它应该是
int main(void)
The second matrix operand should be transposed so you can read multiple values at a time. 第二个矩阵操作数应该转置,这样您就可以一次读取多个值。 The second load would become:
第二次加载将变为:
__m128 row2 = _mm_loadu_ps(&b[j][k]);
The summation phase seems incorrect too. 总和阶段似乎也是错误的。 And the final store is definitely incorrect, should just be:
最后的商店绝对不正确,应该只是:
c[i][j] = sum;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.