在cuda中，加载到共享内存比加载到寄存器要慢

Question

I'm not an experienced CUDA programmer. 我不是一位经验丰富的CUDA程序员。 I got a problem like this. 我遇到了这样的问题。 I'm trying to load a tile (32x32) of a large matrix (10K*10K) from global memory into shared memory and I'm timing it while it happens. 我正在尝试从全局内存中将大矩阵（10K * 10K）的磁贴（32x32）加载到共享内存中，并在发生时对其进行计时。 I realized that If I load it to private memory(registers), it loads 4-5 times faster than shared memory loading. 我意识到，如果我将它加载到专用内存（寄存器），它加载比共享内存加载快4-5倍。

__global__ void speedtest( float *vel,int nx) {

int globalx = blockDim.x * blockIdx.x + threadIdx.x+pad;
int globalz = blockDim.y * blockIdx.y + threadIdx.y+pad;
int localx=threadIdx.x;
int localz=threadIdx.y;

float ptest;
__shared__ float stest[tile][tile];

//stest[localz][localx]=vel[globalz*nx+globalx]; //load to shared memory
ptest=vel[globalz*nx+globalx];  //load to private memory
__syncthreads();
}

I comment out stest and ptest one by one and calculate elapsed time with cudaeventrecord. 我逐一评论stest和ptest并用cudaeventrecord计算经过的时间。 stest took 3.2 ms and ptest took 0.75ms to load. stest耗时3.2毫秒， ptest耗时0.75毫秒。 What am I doing wrong? 我究竟做错了什么？ Timings should be very similar right? 时间应该非常相似吗？ What am I missing? 我错过了什么？

Configuration: Cuda 7.5, gtx 980, only 32bit variables and calculations, no specific purpose is intended, I'm just playing with it. 配置：Cuda 7.5，gtx 980，只有32位变量和计算，没有特定目的，我只是在玩它。

I'm posting sample code as requested 我正在按要求发布示例代码

#include<stdio.h>
#include <math.h>
#define tile 32
#include <helper_cuda.h>
void makeittwo(float *array,int nz,int nx)
{
//this just assigns a number into the vector
int n2;
n2=nx*nz;
for (int i=0;i<n2;i++)
array[i]=2000;
}
__global__ void speedtest( float *vel,int nx,int nz) {

int globalx = blockDim.x * blockIdx.x + threadIdx.x;
int globalz = blockDim.y * blockIdx.y + threadIdx.y;
int localx=threadIdx.x;
int localz=threadIdx.y;

float ptest; //declarations
__shared__ float stest[tile][tile];

if (globalx<nx && globalz<nz){
stest[localz][localx]=vel[globalz*nx+globalx]; //shared variable
//ptest=vel[globalz*nx+globalx];                        //private variable

//comment out ptest and stest one by one to test them  
}
__syncthreads();

}       

int main(int argc,char *argv)
{
int nx,nz,N;
float *vel;

nz=10000;nx=10000; //matrix dimensions
N=nz*nx; //convert matrix into vector

checkCudaErrors(cudaMallocHost(&vel,sizeof(float)*N)); //using pinned memory
makeittwo(vel,nz,nx);

dim3 dimBlock(tile,tile);
dim3 dimGrid;

int blockx=dimBlock.x;
int blockz=dimBlock.y;

dimGrid.x = (nx + blockx - 1) / (blockx);
dimGrid.y = (nz + blockz - 1) / (blockz);

float *d_vel;
checkCudaErrors(cudaMalloc(&d_vel,sizeof(float)*(N))); //copying to device
checkCudaErrors(cudaMemcpy(d_vel, vel, sizeof(float)*(N), cudaMemcpyHostToDevice));

cudaEvent_t start,stop;
float elapsedTime;

cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
speedtest<<<dimGrid,dimBlock>>>(d_vel,nx,nz); //calling the function
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime,start,stop);

printf("time=%3.3f ms\n",elapsedTime);
checkCudaErrors(cudaMemcpy(vel, d_vel, sizeof(float)*N, cudaMemcpyDeviceToHost)); 
//calling the matrix back  to check if all went well (this fails if out of bound calls are made)

cudaDeviceReset();

}

Answer 1

The example code actually does not measure what the OP expects to measure, because some instructions are optimized away by the compiler. 示例代码实际上没有测量OP期望测量的内容，因为一些指令被编译器优化掉了。

In the local variable example ( ptest ) the load does not affect the state outside of the kernel. 在局部变量示例（ ptest ）中，加载不会影响内核之外的状态。 In this case the compiler is free to remove the instruction completely. 在这种情况下，编译器可以完全删除指令。 This can be seen in the SASS code. 这可以在SASS代码中看到。 The SASS code is same when ptest=vel[globalz*nx+globalx]; 当ptest=vel[globalz*nx+globalx];时，SASS代码相同ptest=vel[globalz*nx+globalx]; is active or both statements (ptest and stest) are removed. 处于活动状态或两个语句（ptest和stest）都被删除。 To inspect the SASS code you can run cuobjdump --dump-sass on the object file. 要检查SASS代码，您可以在目标文件上运行cuobjdump --dump-sass 。

Apparently, the instructions are not optimized away in the shared memory example as can be checked in SASS code. 显然，在SASS代码中可以检查共享内存示例中的指令。 (Actually, I would have expected the instructions are removed as well. Are there side-effects that miss?) （实际上，我原本预计指令也会被删除。是否有错过的副作用？）

As already discussed in the comments, with a simple calculation ( ptest*=ptest ) and a write to global memory the compiler cannot remove the instruction because it changes the global state. 正如在注释中已经讨论的那样，通过简单的计算（ ptest*=ptest ）和写入全局存储器，编译器无法删除指令，因为它改变了全局状态。

From the comments of the OP I assume that there is a misunderstanding in how the load operation to shared memory works. 从OP的评论中我假设对共享内存的加载操作如何工作存在误解。 Actually the data is loaded from global memory to registers and then stored in shared memory . 实际上， 数据从全局存储器加载到寄存器，然后存储在共享存储器中 。 The (relevant) SASS instructions (for sm_30) that are generated look like this 生成的（相关）SASS指令（对于sm_30）如下所示

LD.E R2, [R6]; // load to register R2
STS [R0], R2; // store from register R2 to shared memory

The following multiply and store to global memory example demonstrates another case where the compiler does not produce code that one may naively expect: 以下乘法和存储到全局内存示例演示了另一种情况，即编译器不会生成可能天真期望的代码：

stest[localz][localx]=vel[globalz*nx+globalx]; // load to shared memory
stest[localz][localx]*=stest[localz][localx]; // multiply
vel[globalz*nx+globalx]=stest[localz][localx]; // save to global memory

The SASS code shows that the variable is only stored in shared memory after the computation (and never read form shared memory). SASS代码显示变量仅在计算后存储在共享内存中（并且从不读取共享内存）。

LD.E R2, [R6]; // load to register
FMUL R0, R2, R2; // multiply
STS [R3], R0; // store the result in shared memory
ST.E [R6], R0; // store the result in global memory

I am not really an expert in SASS code, please correct me if I am wrong or left out anything important. 我不是SASS代码的专家，请纠正我，如果我错了或遗漏任何重要的东西。

在cuda中，加载到共享内存比加载到寄存器要慢

问题描述

1 个解决方案

解决方案1
5 已采纳 2015-08-13 14:46:21

在cuda中，加载到共享内存比加载到寄存器要慢

问题描述

1 个解决方案

解决方案1 5 已采纳 2015-08-13 14:46:21

解决方案1
5 已采纳 2015-08-13 14:46:21