简体   繁体   中英

CUDA GPU time in MATLAB increasing when the grid size is increased

I am using MATLAB R2017a. I am running a simple code to calculate cumulative sum from the first point until ith point.

my CUDA kernel code is:

__global__ void summ(const double *A, double *B, int N){
    for (int i=threadIdx.x; i<N; i++){
B[i+1] = B[i] + A[i];}}

my MATLAB code is

k=parallel.gpu.CUDAKernel('summ.ptx','summ.cu');

n=10^7;
A=rand(n,1);
ans=zeros(n,1);
A1=gpuArray(A);
ans2=gpuArray(ans);

k.ThreadBlockSize = [1024,1,1];
k.GridSize = [3,1];
G = feval(k,A1,ans2,n);
G1 = gather(G);
GPU_time = toc

I am wondering why the GPU time increasing when i increase the grid size (k,.GridSize). for instant for 10^7 data,

k.GridSize=[1,1] the time is 8.0748s
k.GridSize=[2,1] the time is 8.0792s
k.GridSize=[3,1] the time is 8.0928s

From what i understand, for 10^7 number of data, the system will need 10^7 / 1024 ~ 9767 blocks, so the grid size should be [9767,1].

The GPU device is

Name: 'Tesla K20c'
                 Index: 1
     ComputeCapability: '3.5'
        SupportsDouble: 1
         DriverVersion: 9.1000
        ToolkitVersion: 8
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 5.2983e+09
       AvailableMemory: 4.9132e+09
   MultiprocessorCount: 13
          ClockRateKHz: 705500
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

thank you for your response.

You appear to be worrying about a very very small portion of the time compared to the overall effect. The real question you should be asking is: does this amount of time to solve this problem make sense? The answer to that is no absolutely not.

Here is a modified code which should run much faster

n=10^7;
dev = gpuDevice;

A = randn(n,1,'gpuArray');
B = randn(n,1,'gpuArray');
tic
G = A+cumsum(B);
wait(dev)
toc

On my 1060 this runs in 0.03 seconds. For even faster speeds you can use single precision

At any rate, that 0.02 seconds could be easily attributable to small changes in loads on your GPU. It's a much more likely scenario than having to do with gridsizes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM