高效的SSE NxN矩阵乘法

Question

I'm trying to implement SSE version of large matrix by matrix multiplication. 我正在尝试通过矩阵乘法实现SSE版本的大矩阵。 I'm looking for an efficient algorithm based on SIMD implementations. 我正在寻找一种基于SIMD实现的高效算法。

My desired method looks like: 我想要的方法如下所示：

A(n x m) * B(m x k) = C(n x k)

And all matrices are considered to be 16-byte aligned float array. 并且所有矩阵都被认为是16字节对齐的float数组。

I searched the net and found some articles describing 8x8 multiplication and even smaller. 我在网上搜索，发现一些描述8x8乘法甚至更小的乘法的文章。 I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. 我真的需要尽可能高效，并且我不想使用Eigen库或类似的库。 (Only SSE3 to be more specific). （只有SSE3更具体）。

So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this. 因此，如果有人能帮助我找到一些有关如何开始实施此方法的文章或资源，我将不胜感激。

Answer 1

The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. 实现任意大小的矩阵矩阵乘法的主要挑战不是使用SIMD，而是重用缓存的数据。 The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. 如果要实现缓存友好的矩阵矩阵乘法，必须阅读Goto和Van de Geijn撰写的《高性能矩阵乘法剖析》一书，它还讨论了对SIMD友好的内核的选择。 After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts. 阅读本文后，经过两周的努力，期望在矩阵矩阵乘法上达到50％的机器峰值。

However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. 但是，如果这项工作的目的不是纯粹的学习，我强烈建议使用高度优化的库。 On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). 在x86上，最好的选择是OpenBLAS （BSD许可，支持动态CPU调度）， BLIS （BSD许可，可轻松移植到新处理器）和Intel MKL （商业，支持Intel处理器上的动态CPU调度）。 For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries. 出于性能原因，最好避免使用ATLAS，除非您针对的是非常奇怪的体系结构，而其他库不支持该体系结构。

高效的SSE NxN矩阵乘法

问题描述

1 个解决方案

解决方案1
9 已采纳 2014-02-02 09:14:25

高效的SSE NxN矩阵乘法

问题描述

1 个解决方案

解决方案1 9 已采纳 2014-02-02 09:14:25

解决方案1
9 已采纳 2014-02-02 09:14:25