简体   繁体   English

加快CUDA原子计算的许多箱/几箱

[英]Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. 我正在尝试优化我在CUDA中的直方图计算。 It gives me an excellent speedup over corresponding OpenMP CPU calculation. 它使我在相应的OpenMP CPU计算上获得了极好的加速。 However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. 但是,我怀疑(与直觉一致)大多数像素属于几个桶。 For argument's sake, assume that we have 256 pixels falling into let us say, two buckets. 为了论证,假设我们有256个像素落入我们说两个桶。

The easiest way to do it is to do it appears to be 最简单的方法就是做到这一点

  1. Load the variables into shared memory 将变量加载到共享内存中
    • Do vectorized loads for unsigned char, etc. if needed. 如果需要,对无符号字符等进行矢量化加载。
  2. Do an atomic add in shared memory 在共享内存中添加原子
  3. Do a coalesced write to global. 对全局进行合并写入。

Something like this: 像这样的东西:

__global__ void shmem_atomics_reducer(int *data, int *count){
  uint tid = blockIdx.x*blockDim.x + threadIdx.x;

  __shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
  block_reduced[threadIdx.x] = 0;

  __syncthreads();

    atomicAdd(&block_reduced[data[tid]],1);
  __syncthreads();

  for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
    atomicAdd(&count[i],block_reduced[i]);

}

The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. 当我们减少容器的数量时,这个内核的性能会自然下降,从32个容器的45 GB / s到1个bin的10 GB / s左右。 Contention, and shared memory bank conflicts are given as reasons. 出现争用和共享内存库冲突的原因。 I don't know if there is any way to remove either of these for this calculation in any significant way. 我不知道是否有任何方法可以以任何显着的方式删除这些计算中的任何一个。

I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction. 我也一直在试验另一个(漂亮的)来自parallelforall博客的想法,其中涉及使用__ballot来获取warp结果,然后使用__popc()来进行warp级别缩减。

__global__ void ballot_popc_reducer(int *data, int *count ){
  uint tid = blockIdx.x*blockDim.x + threadIdx.x;
  uint warp_id = threadIdx.x >> 5;

  //need lane_ids since we are going warp level
  uint lane_id = threadIdx.x%32;

  //for ballot
  uint warp_set_bits=0;

   //to store warp level sum
  __shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
   //shared data
  __shared__ uint s_data[NUM_THREADS_PER_BLOCK];

 //load shared data - could store to registers
  s_data[threadIdx.x] = data[tid];

  __syncthreads();


//suspicious loop - I think we need more parallelism
  for(int i=0; i<NUM_BINS; i++){
      warp_set_bits = __ballot(s_data[threadIdx.x]==i);

      if(lane_id==0){
        warp_reduced_count[warp_id] = __popc(warp_set_bits);
      }

     __syncthreads();

      //do warp level reduce 
      //could use shfl, but it does not change the overall picture
      if(warp_id==0){
        int t = threadIdx.x;
        for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
          if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
          __syncthreads();
        }
      }                                                                                                                                                                                                                                                                



      __syncthreads();


      if(threadIdx.x==0){
        atomicAdd(&count[i],warp_reduced_count[0]);
        }  

    }                                                                                                                                                                                                                                             

  }

This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. 对于单个箱子情况(1个箱子为35-40 GB / s,而对于10-15个),这给出了不错的数字(好吧,这是没有实际意义的 - 峰值设备mem bw是133 GB / s,似乎依赖于启动配置) GB / s使用原子),但当我们增加箱数时,性能急剧下降。 When we run with 32 bins, performance drops to about 5 GB/s. 当我们运行32个bin时,性能下降到大约5 GB / s。 The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop. 原因可能是因为单个线程循环遍历所有bin,要求并行化NUM_BINS循环。

I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. 我已经尝试了几种方法来并行化NUM_BINS循环,但这些方法似乎都没有正常工作。 For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. 例如,可以(非常不优雅地)操纵内核为每个bin创建一些块。 This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. 这似乎表现得相同,可能是因为我们再次遭遇试图从全局内存中读取的多个块的争用。 Plus, the programming is clunky. 此外,编程很笨重。 Likewise, parallelizing in the y direction for bins gives similarly uninspiring results. 同样,对于箱子在y方向上的并行化也给出了类似的不令人满意的结果。

The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. 我尝试踢的另一个想法是动态并行,为每个bin启动一个内核。 This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead. 这是一个灾难性的缓慢,可能是由于没有真正的计算工作的子内核和启动开销。

The most promising approach seems to be - from Nicholas Wilt's article 最有希望的方法似乎是 - 来自Nicholas Wilt的文章

on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell). 使用这些所谓的私有化直方图,其中包含共享内存中每个线程的二进制文件,表面上看起来非常沉重于shmem的使用(我们在Maxwell上每个SM只有48 kB)。

Perhaps someone could shed some insight into the problem? 也许有人可以对这个问题有所了解? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. 我觉得应该改变算法而不是使用直方图来使用频率较低的东西。 Otherwise, I suppose we just use the atomics version. 否则,我想我们只使用原子版。

Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. 编辑:我的问题的上下文是计算用于模式分类的概率密度函数。 We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. 我们可以使用Parzen Windows或Kernel Density Estimation等非参数方法计算近似直方图(更准确地说是pdfs)。 However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. 然而,这并没有克服维数问题,因为我们需要对每个箱的所有数据点求和,当箱的数量变大时这变得昂贵。 See here: Parzen 看到这里: Parzen

I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. 我遇到了类似的问题来处理聚类,但在最后,最好的解决方案是使用扫描模式对处理进行分组。 So, I don't think that it would work for you. 所以,我不认为这对你有用。 Since you asked for some experience in this are, I'll share mine with you. 既然你问过这方面的经验,我会和你分享。

The issues 问题

In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. 在你的第一个代码中,我认为低性能和减少数量的处理与warp失速有关,因为你对每个评估数据执行的处理非常少。 When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. 当bin的数量增加时,该内核的处理和全局内存负载(数据信息)之间的关系也会增加。 You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. 您可以使用Nsight的Performance Analysis中的“Issue Efficiency”实验轻松检查。 Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency). 可能你会得到一个低循环率,至少有一个简单的扭曲(扭曲问题效率)。

没有一个简单的扭曲的充足周期 Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles. 由于我无法将可触线扭曲的数量提高到接近95%的某个地方,我放弃了这种方法,因为在某些情况下它变得更糟(内存依赖性在我的处理周期中停滞了90%)。 在此输入图像描述

The shuffle and vote reduction is very usefull if the number of bins is not to large. 如果垃圾箱的数量不是很大,那么洗牌和投票减少是非常有用的。 If it is to large, a small amount of threads should be active for every bin filter. 如果它很大,每个bin过滤器应该有少量线程处于活动状态。 So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. 因此,您最终可能会遇到很多代码偏差,这对于并行处理来说是非常不受欢迎的。 You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks. 您可以尝试对分歧进行分组以便删除分支并具有良好的控制流,因此整个warp / block提供了类似的处理,但是跨块的机会很多。

在此输入图像描述

A feasible solution 可行的解决方案

I don't know where, but there are very good solutions for your problem around that I saw. 我不知道在哪里,但是我看到你的问题有非常好的解决方案。 Did you tried this one ? 你试过这个吗?

Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance: 你也可以使用矢量化加载并尝试这样的东西,但我不确定它能提高你的性能:

__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {

__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0; 

for (int i = 0; i < N; i+= warpSize) {
    atomicAdd(&sBins[data[i + init].w], 1);
    atomicAdd(&sBins[data[i + init].x], 1);
    atomicAdd(&sBins[data[i + init].y], 1);
    atomicAdd(&sBins[data[i + init].z], 1);
}

//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
    atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
} 
//update your histogram data here
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM