关于cuddpp中的紧凑操作

Question

The following kernel function is the compact operation in the cudpp, a cuda library (http://gpgpu.org/developer/cudpp). 以下内核功能是cudpp（一个cuda库，http：//gpgpu.org/developer/cudpp）中的紧凑操作。

My question is why the developer repeats the the writing part 8 times? 我的问题是为什么开发人员重复写作部分8次？ And why it can improve the performance? 以及为什么它可以提高性能？

And why one thread process 8 elements, why not each thread process one element? 为什么一个线程处理8个元素，为什么每个线程都不处理一个元素？

template <class T, bool isBackward>
__global__ void compactData(T                  *d_out, 
                        size_t             *d_numValidElements,
                        const unsigned int *d_indices, // Exclusive Sum-Scan Result
                        const unsigned int *d_isValid,
                        const T            *d_in,
                        unsigned int       numElements)
{
  if (threadIdx.x == 0)
  {
        if (isBackward)
            d_numValidElements[0] = d_isValid[0] + d_indices[0];
    else
        d_numValidElements[0] = d_isValid[numElements-1] + d_indices[numElements-1];
   }

   // The index of the first element (in a set of eight) that this
   // thread is going to set the flag for. We left shift
   // blockDim.x by 3 since (multiply by 8) since each block of 
   // threads processes eight times the number of threads in that
   // block
   unsigned int iGlobal = blockIdx.x * (blockDim.x << 3) + threadIdx.x;

   // Repeat the following 8 (SCAN_ELTS_PER_THREAD) times
   // 1. Check if data in input array d_in is null
   // 2. If yes do nothing
   // 3. If not write data to output data array d_out in
   //    the position specified by d_isValid
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;  
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];       
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
   iGlobal += blockDim.x;
   if (iGlobal < numElements && d_isValid[iGlobal] > 0) {
       d_out[d_indices[iGlobal]] = d_in[iGlobal];
   }
}

Answer 1

My question is why the developer repeats the the writing part 8 times? 我的问题是为什么开发人员重复写作部分8次？ And why it can improve the performance? 以及为什么它可以提高性能？

As @torrential_coding stated, loop unrolling can help performance. 如@torrential_coding所述，循环展开可以帮助提高性能。 In particular in case like this, where the loop is very tight (it has little logic in it). 特别是在这种情况下，循环非常紧密（其中几乎没有逻辑）。 However, the coder should have used CUDA's support for automatic loop unrolling instead of doing it manually. 但是，编码人员应该使用CUDA支持自动循环展开，而不是手动执行。

And why one thread process 8 elements, why not each thread process one element? 为什么一个线程处理8个元素，为什么每个线程都不处理一个元素？

There might be some small performance gain in only calculating the full index of iGlobal and checking for threadIdx.x of zero for every 8 operations instead of for each operation, which would have to be done if each kernel did only one element. 仅计算iGlobal的完整索引并检查每8个操作而不是每个操作的threadIdx.x为零，可能会获得一些小的性能提升，如果每个内核仅执行一个元素，则必须这样做。

关于cuddpp中的紧凑操作

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-06-01 18:10:35

关于cuddpp中的紧凑操作

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-06-01 18:10:35

解决方案1
1 已采纳 2012-06-01 18:10:35