简体   繁体   中英

How to get "sum" of parallel arrays in cuda?

my problem is about getting "sum" for some same length arrays. For example,I have a M*N(100 * 2000) length float array in all. I would like to get M(100) sum values of every N(2000) float numbers. I found two ways to do this job. One is with Cublas function in a for loop for M,like cublasSasum . The other is self-written kernel function, adding numbers in loop. My problem is the speed of these two ways and how to choose between them.

For Cublas method, no matter how big is N(4000~2E6), the time consuming is depending mainly on M, the loop number.

For self-written kennel function, the speed varied much with N. In detail, if N is small, below 5000, it runs much faster than the Cublas way. Then the time consumption is increasing with N's increasing.

N = 4000 |10000 | 40000 | 80000 | 1E6 | 2E6

t = 254ms| 422ms | 1365ms| 4361ms| 5399ms | 10635ms

If N is big enough, it runs much slower than Cublas way. My problem is how could I make a predition with M or N to decide which way I should use? My code might be used on different GPU device. Must I compare the speed in a parameter swept and then "guess" to make a choice in every GPU device, or I could inference from GPU device information?

Besides, for the kernel function way,I also have problem in deciding the blockSize and gridSize . I must note here that what I concern more is speed not efficiency. Because the memory is limited. For example, if I got 8G memory. My dataformat is float in 4 bytes. N is 1E5. Then M is at most 2E4, which is smaller than the MaxGridSize . So I got two ways as below. I found have a bigger gridSize is always better, I don't know the reason. Is it about the usage of register number per thread? But I don't think it needs many registers per thread in this kernel function.

Any suggestion or information would be appreciated. Thank you.

Cublas way

for (int j = 0;j<M;j++)
    cublasStatus = cublasSasum(cublasHandle,N,d_in+N*j,1,d_out+j);

self-written kernel way

__global__ void getSum(int M, int N, float* in, float * out)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if(i<M){
        float tmp = 0;
        for(int ii = 0; ii<N; ii++){
            tmp += *(in+N*i+ii);
        }
        out[i] = tmp;
    }
}

Bigger gridSize is faster. I don't know the reason.

getSum<<<M,1>>>(M, N, d_in, d_out); //faster
getSum<<<1,M>>>(M, N, d_in, d_out); 

This is a blockSize-time parameter swept result. M = 1E4.N = 1E5.

cudaEventRecord(start, 0);
//blockSize = 1:1024;
int gridSize = (M + blockSize - 1) / blockSize;
getSum<<<gridSize1,blockSize1>>>...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

It seems I should choose a relative small blockSize , like 10~200. I just would like to know why the full occupancy(blockSize 1024) is slower. I just post here for some possible reasons, registers number?latency? 在此处输入图像描述

Using CuBLAS is generally a very good idea and should be preferred if there is dedicated function doing want you want directly, especially for large datasets. That being said, you timings are very bad for a GPU kernel working on such small dataset. Let us understand why.

Bigger gridSize is faster. I don't know the reason.
getSum<<<M,1>>>(M, N, d_in, d_out);
getSum<<<1,M>>>(M, N, d_in, d_out);

The syntax of calling a CUDA kernel is kernel<<<numBlocks, threadsPerBlock>>> . Thus the first line submit a kernel with M blocks of 1 threads. Don't do that : this is very inefficient . Indeed, The CUDA programming manual say:

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor , and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. [...]
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps . [...]
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch , the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

As a result, the first call create M blocks of 1 threads wasting 31 CUDA cores of 32 available in each warp. It means that you will likely read only 3% of the peak performance of your GPU...

The second call create one block of M threads. Because M is not a multiple of 32, few CUDA core are wasted. Moreover, it uses only 1 SM over the many available on your GPU because you have only one block. Modern GPUs have dozens of SMs (my GTX-1660S has 22 SM). This means that you will use only a tiny fraction of your GPU capability (few %). Not to mention that the memory access pattern is not contiguous slowing down even more the computation...

If you want to use your GPU much more efficiently, you need to provide more parallelism and to waste less resources . You can start by writing a kernel working on a 2D grid performing a reduction using atomics . This is not perfect, but much better than your initial code. You should also take care of reading memory contiguously (threads sharing the same warp should read/write a contiguous block of memory).

Please read precisely the CUDA manual or tutorials before writing CUDA code. It describes all of this very well and accurately.


UPDATE:

Based on the new informations, the problem you are experimenting with the blockSize is likely due to the strided memory accesses in the kernel (more specifically the N*i ). Strided memory access patterns are slow and are generally slower when the stride is getting bigger. In your kernel, each thread will access to a different block in memory. GPU (and actually most hardware computing units) are optimized for accessing contiguous chunks of data as previously said. If you want to solve this problem and get faster results, you need to work on the other dimension in parallel (so not M but N ).

Furthermore, the BLAS calls are inefficient because each iteration of the loop on the CPU will call a kernel on the GPU. Calling a kernel introduces a quite big overhead (typically from few microseconds up to ~100 us). Thus doing this in a loop called tens of thousands of times will be very slow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM