简体   繁体   English

CUDA表现质疑

[英]CUDA performance doubts

Since i didnt got a response from the CUDA forum, ill try it here: 由于我没有得到CUDA论坛的回复,请在这里试试:

After doing a few programs in CUDA ive now started to obtain their effective bandwidth. 在CUDA中做了一些程序后,我们现在开始获得它们的有效带宽。 However i have some strange results, for example in the following code, where i can sum all the elements in a vector(regardless of dimension), the bandwidth with the Unroll Code and the "normal" code seems to have the same median result(around 3000 Gb/s) I dont know if im doing something wrong(AFAIK the program works fine) but from what ive read so far, the Unroll code should have a higher bandwidth. 但是我有一些奇怪的结果,例如在下面的代码中,我可以求和向量中的所有元素(无论维度),使用Unroll Code和“normal”代码的带宽似乎具有相同的中值结果(大约3000 Gb / s)我不知道我是做错了什么(AFAIK程序工作正常)但是从我到目前为止读到的,Unroll代码应该有更高的带宽。

#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <math.h>
#define elements 1000
#define blocksize 16    


__global__ void vecsumkernel(float*input, float*output,int nelements){



    __shared__ float psum[blocksize];
    int tid=threadIdx.x;

    if(tid + blockDim.x * blockIdx.x < nelements)
    psum[tid]=input[tid+blockDim.x*blockIdx.x];
    else
    psum[tid]=0.0f;
    __syncthreads();

    //WITHOUT UNROLL

    int stride;     
    for(stride=blockDim.x/2;stride>0;stride>>=1){
            if(tid<stride)
                    psum[tid]+=psum[tid+stride];
    __syncthreads();
    }
    if(tid==0)
            output[blockIdx.x]=psum[0];


    //WITH UNROLL
 /*
    if(blocksize>=512 && tid<256) psum[tid]+=psum[tid+256];__syncthreads();
    if(blocksize>=256 && tid<128) psum[tid]+=psum[tid+128];__syncthreads();
    if(blocksize>=128 && tid<64) psum[tid]+=psum[tid+64];__syncthreads();


    if (tid < 32) {
            if (blocksize >= 64) psum[tid] += psum[tid + 32];
            if (blocksize >= 32) psum[tid] += psum[tid + 16];
            if (blocksize >= 16) psum[tid] += psum[tid + 8];
            if (blocksize >=  8) psum[tid] += psum[tid + 4];
            if (blocksize >=  4) psum[tid] += psum[tid + 2];
            if (blocksize >=  2) psum[tid] += psum[tid + 1];
    }*/

    if(tid==0)
            output[blockIdx.x]=psum[0];



}

void vecsumv2(float*input, float*output, int nelements){
    dim3 dimBlock(blocksize,1,1);
    int i;

    for(i=((int)ceil((double)(nelements)/(double)blocksize))*blocksize;i>1;i(int)ceil((double)i/(double)blocksize)){
            dim3 dimGrid((int)ceil((double)i/(double)blocksize),1,1);
            printf("\ni=%d\ndimgrid=%u\n ",i,dimGrid.x);

            vecsumkernel<<<dimGrid,dimBlock>>>(i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ?input:output,output,i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ? elements:i);
    }

 }

 void printVec(float*vec,int dim){
    printf("\n{");
    for(int i=0;i<dim;i++)
            printf("%f ",vec[i]);
    printf("}\n");
 }

 int main(){
    cudaEvent_t evstart, evstop;
    cudaEventCreate(&evstart);
    cudaEventCreate(&evstop);


    float*input=(float*)malloc(sizeof(float)*(elements));
    for(int i=0;i<elements;i++)
            input[i]=(float) i;


    float*output=(float*)malloc(sizeof(float)*elements);



    float *input_d,*output_d;

    cudaMalloc((void**)&input_d,elements*sizeof(float));

    cudaMalloc((void**)&output_d,elements*sizeof(float));



    cudaMemcpy(input_d,input,elements*sizeof(float),cudaMemcpyHostToDevice);


    cudaEventRecord(evstart,0);

    vecsumv2(input_d,output_d,elements);

    cudaEventRecord(evstop,0);
    cudaEventSynchronize(evstop);
    float time;
    cudaEventElapsedTime(&time,evstart,evstop);
    printf("\ntempo gasto:%f\n",time);
    float Bandwidth=((1000*4*2)/10^9)/time;
    printf("\n Bandwidth:%f Gb/s\n",Bandwidth);


    cudaMemcpy(output,output_d,elements*sizeof(float),cudaMemcpyDeviceToHost);


    cudaFree(input_d);
    cudaFree(output_d);
    printf("soma do vector");
    printVec(output,4);



   }

Your unrolled code has a lot of branching in it. 您展开的代码中包含大量分支。 I count ten additional branches. 我数十个额外的分支机构。 Typically branching within a warp on a GPU is expensive because all threads in the warp end up waiting on the branch (divergence). 通常,GPU上的warp内的分支是昂贵的,因为warp中的所有线程最终都在等待分支(发散)。

See here for more info on warp divergence: 有关扭曲发散的更多信息,请参见此处:

http://forums.nvidia.com/index.php?showtopic=74842 http://forums.nvidia.com/index.php?showtopic=74842

Have you tried using a profiler to see what's going on? 您是否尝试使用分析器查看发生了什么?

3000 Gb/s Does not make sense. 3000 Gb / s没有意义。 The max bus speed of PCIe is 8Gb/s on each direction. 每个方向的PCIe最大总线速度为8Gb / s。

Take a look at this paper Parallel Prefix Sum to gain insight on how to speed up your implementation. 看一下本文并行前缀和,以深入了解如何加快实施速度。 Also consider that the thrust library have this already implemented in the Reductions module 还要考虑推力库已在减少模块中实现了这一点

your not-unrolled code is invalid. 您未展开的代码无效。 For stride<32 some threads of the same warp enter the for-loop, while the others do not. 对于stride<32 ,同一warp的某些线程进入for循环,而其他线程则不进入for循环。 Therefore, some (but not all) threads of the warp hit the __syncthreads() . 因此,warp的一些(但不是全部)线程命中__syncthreads() CUDA specification says that when that happens, the behaviour is undefined. CUDA规范说,当发生这种情况时,行为是未定义的。

It can happen that warp gets out of sync and some threads already begin loading next chunk of data, halting on next instances of __syncthreads() while previous threads are still stuck in your previous loop. 可能会发生warp不同步,一些线程已经开始加载下一个数据块,在__syncthreads()下一个实例上暂停,而之前的线程仍然停留在前一个循环中。

I am not sure though if that is what you are going to face in this particular case. 我不确定这是否是你在这个特殊情况下要面对的。

I see you're doing Reduction Sum in kernel. 我看到你在内核中做了减少总和。 Here's a good presentation by NVIDIA for optimizing reduction on GPUs. 以下是NVIDIA对优化GPU降低的精彩演示 You'll notice that the same code that was giving a throughput of 2 GB/s is optimized to 63 GB/s in this guide. 您会注意到,本指南中提供2 GB / s吞吐量的相同代码优化为63 GB / s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM