简体   繁体   English

在一次操作中进行多个矩阵 - 矩阵乘法

[英]Doing multiple matrix-matrix multiplications in one operation

I'm implementing an algorithm that, in essence, is a series of matrix-matrix multiplications like this: 我正在实现一种算法,实质上是一系列矩阵 - 矩阵乘法,如下所示:

Res = M1.M2.M3. ... .Mn

My matrices are really small 100x100 floats, but the sequence is really long, in the order of billions. 我的矩阵实际上是小的100x100浮点数,但序列非常长,大约数十亿。

I tried using CUBLAS to to the matrix multiplications but this was slow, I did however notice something interesting. 我尝试使用CUBLAS来进行矩阵乘法,但这很慢,但我确实注意到了一些有趣的东西。

multiplying a 100x100 with a 100x100 matrix was slow, but multiplying a 1.000.000x100 with a 100x100 was relatively fast, this made me think .If I instead of having a scan from left to right had 10.000 scans in parallel. 将100x100与100x100矩阵相乘的速度很慢,但是将1.000.000x100乘以100x100相对较快,这让我想到了。如果我从左到右进行扫描而不是并行扫描10.000次。 This should be pretty fast, and if I multiplied my matrices when I was done with this, I would get the same result -- just faster. 这应该是非常快的,如果我在完成这项工作后将我的矩阵相乘,我会得到相同的结果 - 只是更快。

Res1 = M1.M2.M3. ... .Mn/1000-1
Res1 = M1+n/1000.M2+n/1000.M3+n/1000. ... .M2(n/1000)-1
...
Res1  = M1+999*n/1000.M2+999*n/1000.M3+999*n/1000. ... .M1000*(n/1000)-1
Res = Res1*Res2* ... *Res999

Its worth nothing that M_1 ... M_n are in a set of about 100 different matrices, so space consumption isn't really a problem, all I need to to is be to do multiple multiplies in one operation. 毫无价值M_1 ... M_n在一组约100个不同的矩阵中,因此空间消耗并不是真正的问题,我需要做的就是在一次操作中进行多次乘法。

Now here is my problem. 现在这是我的问题。 I've done a matrix-matrix(sgemm) implementation inspired by the one nvidia demonstrates in their documentation but it is an order of about 4 times as slow as cublas. 我已经完成了一个矩阵矩阵(sgemm)实现的灵感来自他们的文档中的一个nvidia演示,但它的顺序大约是cublas的4倍。 Do anyone know how CUBLAS works? 有谁知道CUBLAS如何运作? And if the code is available somewhere? 如果代码在某处可用?

Have you looked at the latest CUBLAS (version 4.1) ? 你看过最新的CUBLAS(4.1版)吗? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. 它包括一个新的批量GEMM模式,专门用于大批量的小矩阵矩阵乘法。 I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests. 我建议做一个成对的乘法树,正如Jonathan Dursi在他的回答中所建议的那样,使用CUBLAS批处理的API加速它,而不是像他建议的那样编写你自己的自定义内核。

CUBLAS 4.1 is included with the CUDA Toolkit v4.1 . CUBLAS 4.1包含在CUDA Toolkit v4.1中

CUBLAS BATCHED GEMM API提高了小矩阵批次的性能

The problem is that cublas, etc are designed for using all of the SMs to multiply large matrices. 问题是cublas等设计用于使用所有SM来乘以大矩阵。 That's not what you want; 那不是你想要的; you want to do lots of little matrix multiplications. 你想做很多小矩阵乘法。

There may be some way to cast this into something CUBLAS could do well for you, but I'm not seeing it. 可能有某种方法可以将其转化为CUBLAS可以为您做好的事情,但我没有看到它。 My suggestion would be the following: 我的建议如下:

Write a kernel that uses one thread block to multiply two of your small matrices, and output the result. 编写一个使用一个线程块的内核来乘以两个小矩阵,然后输出结果。

Then launch the kernel log 2 N with tonnes of blocks and tackle the multiplication pairwise: 然后使用数量的块启动内核log 2 N并成对处理乘法:

  • Step 1: multiply M 1 x M 2 , M 3 x M 4 ... M N - 2 x M N-1 , outputting M' 1 ,M' 2 .. M' N/2 步骤1:乘M 1 X M 2,M 3×M4 ...,M N - 2×M N-1,输出M '1,M' 2 .. M” N / 2
  • Step 2: multiply M' 1 x M' 2 , M' 3 x M' 4 ... M' N/2 - 2 x M N/2-1 , outputting M'' 1 ,M'' 2 .. M'' N/4 ... 步骤2:乘以M '1×M' 2,M '3×M' 4 ... M” N / 2 - 2×MN / 2-1,输出M'1,M'2 ..中号'' N / 4 ......

etc. 等等

There'll be a factor of 50% memory overhead, but I think you'll make better use of your cores this way. 存在50%的内存开销因素,但我认为您可以通过这种方式更好地利用内核。

Update 更新

Ok, if you really don't want to do this in stages, you could do it this way but it'll require more coding, and performance will probably be worse compared to what you could get with something like cuBLAS and asynchronous transfer. 好吧,如果你真的不想分阶段这样做,你可以这样做,但它需要更多的编码,与你使用cuBLAS和异步传输的东西相比,性能可能会更差。 I'm assuming you're using a Fermi, and you've turned off L1 cache so you have 48K shared mem per SM. 我假设您正在使用Fermi,并且您已关闭L1缓存,因此每个SM有48K共享内存。

Store the 100 matrices in 2x2 block form, with each block contiuous in memory. 以100x2格式存储100个矩阵,每个块在内存中连续。 So matrix[matrixnum,i,j] starts at matricies[matrixnum*100*100 + i*100*50 + j*50*50] . 因此matrix[matrixnum,i,j]matricies[matrixnum*100*100 + i*100*50 + j*50*50]开始matricies[matrixnum*100*100 + i*100*50 + j*50*50] Note that each block is 50*50*4 bytes ~ 10K, so 4 comfortably fit in shared memory. 请注意,每个块是50 * 50 * 4字节~10K,因此4可以很好地适合共享内存。

Assign each 4 threadblock an (Nmatricies/Nblocks) long chain of the matrices to multiply, with one of the four being responsible for each block of the multiplication. 为每个4个线程块分配一个(Nmatricies / Nblocks)长矩阵的矩阵乘以,其中四个中的一个负责乘法的每个块。

Let's say you're threadblock 1 of 4 and the first of the matricies you're to multiply is AxB. 假设你是第4个中的第1个线程,你要成倍增加的第一个基础是AxB。 You're responsible for (1,1) of the result - (AB) 1,1 = A 1,1 B 1,1 + A 1,2 *B 2,1 . 你负责(1,1)的结果 - (AB) 1,1 = A 1,1 B 1,1 + A 1,2 * B 2,1 You pre-load in A 1,1 into myblock[0] in shared memory. 您将A 1,1预加载到共享内存中的myblock [0]中。

  • load in myblock[1] = B 1,1 from global memory 在全局内存中加载myblock [1] = B 1,1
  • myblock[3] = myblock[0] * myblock[1] (matrix mult, all in shared memory) myblock [3] = myblock [0] * myblock [1](矩阵mult,全部在共享内存中)
  • load in myblock[1] = A 1,2 from global 在myblock中加载[1] =来自全局的1,2
  • load in myblock[2] = B 2,1 from global 加载myblock [2] = B 2,1来自全球
  • myblock[0] = myblock[3] + (myblock[1] * myblock[2]) (matrix mult and addition, all in shared memory). myblock [0] = myblock [3] +(myblock [1] * myblock [2])(矩阵mult和加法,全部在共享内存中)。

Now you can repeat this for the rest of the sequence of matrices in your part of the chain, outputting only when done. 现在,您可以对链中部分矩阵序列的其余部分重复此操作,仅在完成时输出。

When you're done, you'll end up with (#SMs) matricies in global memory, which still have to be multiplied, but there won't be any additional temporary storage in global memory, and you won't have had to copy data into global memory other than the original matricies and the lists of which ones to tackle. 当你完成后,你最终会得到(#SMs)全局内存中的matricies,它仍然必须成倍增加,但是在全局内存中不会有任何额外的临时存储空间,你将不必将数据复制到除原始基质之外的全局存储器以及要处理的那些列表中。

Again, there's no real reason to do this except that you can't be bothered to ship data to the GPU in stages, and performance will almost certainly be worse; 再说一遍,除了你不能分阶段将数据传送到GPU之外,没有真正的理由这样做,而且性能几乎肯定会更差; there's fewer global memory writes, but you'll probably pay for that with a hand-rolled GEMM. 全局内存写入次数较少,但您可能会使用手动GEMM来支付费用。 The good news is that 50 isn't a multiple of 8, so you probably won't have too much in the way of shared memory bank conflicts. 好消息是50不是8的倍数,所以你可能不会有太多的共享内存库冲突。

Again, for bonus points, you can precompute all the blocks of all pairwise matrix products first and then half the length of your list. 同样,对于奖励积分,您可以首先预先计算所有成对矩阵产品的所有块,然后再列出列表长度的一半。

LIBXSMM - a library targeting Intel Architecture for small, dense or sparse matrix multiplications, and small convolutions is exactly meant to exploit best performance for small matrix multiplications. LIBXSMM - 针对小型,密集或稀疏矩阵乘法的英特尔架构的库,以及小卷积,旨在为小型矩阵乘法利用最佳性能。

In contrast to NVidia CUBLAS (or Intel MKL), LIBXSMM does not rely on a batch interface. 与NVidia CUBLAS(或英特尔MKL)相比,LIBXSMM不依赖于批处理界面。 Instead, one can arrange for individual calls and also supply "next locations" ie, where the operands/matrices of the next multiplications are located (in memory). 相反,可以安排单独的呼叫并且还提供“下一个位置”,即,下一个乘法的操作数/矩阵所在的位置(在存储器中)。 The advantage is that an explicit data structure or index format describing the batch is not needed. 优点是不需要描述批处理的显式数据结构或索引格式。

#include <libxsmm.h>

int main()
{
  const libxsmm_gemm_prefetch_type prefetch = LIBXSMM_PREFETCH_AUTO;
  const double alpha = 1.0, beta = 1.0; /* accumulate C-matrix */
  const int m = 23, n = 23, k = 23;     /* some problem size */
  libxsmm_dmmfunction xmm = NULL;       /* function pointer */

  xmm = libxsmm_dmmdispatch(23, 23, 23, /* some problem size */
          NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
          &alpha, &beta, NULL/*flags*/,
          NULL/*&prefetch*/);

  if (xmm) { /* JiT'ted code has been generated */
#   pragma omp parallel for
    for (int i = 0; i < nbatch; ++i) {
      const double *const ai = a + i * asize;
      const double *const bi = b + i * bsize;
      /* e.g., matrix C is accumulated (instead of streamed) */
      double *const ci = c /*+ i * csize*/;
      /* optionally provide "next locations" */
      xmm(ai, bi, ci/*,
          ai + 1 * asize,
          bi + 1 * bsize,
          ci + 0 * csize
      */);
    }
  }
}

LIBXSMM produces highly optimized and specialized code (JiT), which exploits latest instruction set extensions (SSE3, AVX, AVX2, and AVX-512). LIBXSMM生成高度优化和专用的代码(JiT),它利用最新的指令集扩展(SSE3,AVX,AVX2和AVX-512)。 LIBXSMM is available under a non-permissive license (BSD-3 clause). LIBXSMM是可用的下一个非允许许可证(BSD-3子句)。

NOTE: This is not about CUBLAS (as originally asked for). 注意:这不是CUBLAS(最初要求的)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM