[英]Faster Matrix Multiplication in CUDA
Currently, I made a neural networks program in the cuda c. 目前,我在cuda中编写了一个神经网络程序。 Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM.
因为我需要处理矩阵乘法,所以我没有对MM使用CUBLAS。 I use the following code for MM.
我将以下代码用于MM。 I was wondering if any one has some advice to make it faster which can be very helpful since I need to use MM millions of times during learning.
我想知道是否有人建议加快速度,这对我很有帮助,因为我在学习过程中需要使用MM数百万次。 Thanks.
谢谢。 This is the MakeFile:
这是MakeFile:
# cuda root
_CUDA_ROOT_ = /usr/local/cuda
NVCC = nvcc
# include and lib paths
INCLUDES=-I${_CUDA_ROOT_}/include
LIB_PATH=-L${_CUDA_ROOT_}/lib64
# libraries to link against
LIB= -lcudart -lcublas
CU_SRC= main.cu
EXE=$(CU_SRC:.cu=)
#------------------------------
# Choose your gpu arch
SM = sm_35
all: $(EXE)
$(EXE): $(CU_SRC)
$(NVCC) -arch $(SM) $(CU_SRC) -o $(EXE) $(LIB_PATH) $(LIB)
clean:
rm -f *.o *.cu_o $(EXE)
This is the MM code: 这是MM代码:
__global__
void matrixMulti(float* A_d, float* B_d, float* C_d, int m, int k, int n)
{
__shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];
int col = blockIdx.x*blockDim.x + threadIdx.x;
int row = blockIdx.y*blockDim.y + threadIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
float sum = 0;
for(int t=0; t<(n-1)/TILE_WIDTH+1; t++)
{
if(row<m && t*TILE_WIDTH+tx<n)
ds_A[ty][tx] = A_d[row*n + t*TILE_WIDTH+tx];
else
ds_A[ty][tx] = 0.0;
if(t*TILE_WIDTH+ty<n && col<k)
ds_B[ty][tx] = B_d[(t*TILE_WIDTH+ty)*k + col];
else
ds_B[ty][tx] = 0.0;
__syncthreads();
for(int i=0; i<TILE_WIDTH; i++)
sum += ds_A[ty][i] * ds_B[i][tx];
__syncthreads();
}
if(row<m && col<k)
C_d[col+row*k] = sum;
}
This is the example of main part of the code: 这是代码主要部分的示例:
const int TILE_WIDTH = 32;
int main()
{
int m, k, n;
m = 10000, k = 10000, n = 10000;
float *A, *B, *C;
A = new float[m*n];
B = new float[n*k];
C = new float[m*k];
float *A_d, *B_d, *C_d;
for (int i=0; i<m*n; i++)
{
A[i] = 2;
}
for (int i=0; i<n*k; i++)
{
B[i] = 3;
}
cudaMalloc(&A_d, sizeof(float)*m*n);
cudaMalloc(&B_d, sizeof(float)*n*k);
cudaMalloc(&C_d, sizeof(float)*m*k);
cudaMemcpy(A_d, A, sizeof(float)*m*n, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeof(float)*k*n, cudaMemcpyHostToDevice);
dim3 dimGrid((k-1)/TILE_WIDTH+1, (m-1)/TILE_WIDTH+1, 1);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);
matrixMulti<<<dimGrid,dimBlock>>>(A_d, B_d, C_d, m, k, n);
cudaMemcpy(C, C_d, sizeof(float)*m*k, cudaMemcpyDeviceToHost);
return 0;
}
Firstly, be really sure this is what you want to do. 首先,请确保这是您要执行的操作。 Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation.
在不描述您要执行的操作的情况下,很难对此进行评论,但是请注意,矩阵乘法是n立方运算。 If your manipulations are not the same complexity, chances are you'll do better simply using cuBLAS.
如果您的操作复杂度不一样,那么使用cuBLAS可能会做得更好。
Why is this? 为什么是这样? cuBLAS will probably be faster than anything you'll write, and will be much more maintainable as it will follow new GPU architectures.
cuBLAS可能会比您将要编写的任何东西都要快,并且将遵循新的GPU架构,因此将具有更高的可维护性。 The best implementation of something like GEMM will vary based on architecture, so any code you're writing now for your hardware will have to be re-optimised for new hardware.
诸如GEMM之类的最佳实现会因体系结构而异,因此,您现在为硬件编写的任何代码都必须针对新硬件进行重新优化。
Now, to the question. 现在,到这个问题。 There's a number of techniques you should consider to optimise this code:
您应该考虑多种技术来优化此代码:
There are a number of papers on the implementation of matrix multiplication on GPUs, I suggest you check them out. 关于在GPU上实现矩阵乘法的论文很多,我建议您将它们检出。 You'll get a lot more detail from these papers than you will asking broad questions on SO.
从这些论文中获得的细节比向SO提出广泛的问题要多得多。
Finally... are you sure you don't want to use cuBLAS? 最后...您确定不想使用cuBLAS吗? I wouldn't count on getting 75% of cuBLAS performance, and even that will be a challenge.
我不会指望获得cuBLAS性能的75%,即使那样也将是一个挑战。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.