使用CUDA查找數組中大小未知的區域的最大值

Question

假設我有一個值A[4000]的數組，其中包含所有不同的數字[45,21,764,234,7,0,12,55,...]

然后，我有另一個數組B[4000] ，它表示數組A中區域的位置，如果它是區域的一部分，則數字為1 ，否則為0 。 如果1's彼此相鄰，則表示它們是同一區域的一部分，如果它們彼此不相鄰（ 1's之間存在0 ），則它們是不同區域的一部分。

恩。 B = [1,1,1,0,1,1,0,0...]意味着我想找到在的區域中的最大值first three numbers in array A中，最大數5th and 6th numbers in array A, etc.這樣我就可以生成一個數組C[4000] ，該數組在用B表示的每個區域中保存A的最大值，在不屬於該區域的區域中保存一個0 。

因此，在這種情況下C = [764,764,764,0,7,7,0,0...]

可以有0 to 2,000 regions ， 0 to 2,000 regions的長度可以是2 to 4,000 numbers long 。 我永遠不知道有多少個區域或區域的大小不同。

我一直在嘗試在CUDA中提出可以實現此結果的內核。 由於它實際上將用於圖像，因此需要盡快完成，這只是一個簡化的示例。 我所有的想法，例如使用reduce，僅在只有一個區域跨越數組A所有4000數字時才有效。 但是，我不認為我可以在此處使用縮小，因為數組中可以有多個區域，它們之間的間隔為1到3996空格（ 0's ），並且縮小會使我失去對分離區域的跟蹤。 或者，內核有太多的循環，並且其中的if語句太快，例如

int intR = 0;
 while(B[blockIdx.x * blockDim.x + threadIdx.x + intR] > 0){
     intMaxR = intMaxR < A[blockIdx.x * blockDim.x + threadIdx.x + intR] ? A[blockIdx.x * blockDim.x + threadIdx.x + intR] : intMaxR;
     intR++;
 }

 int intL = 0;
 while(B[blockIdx.x * blockDim.x + threadIdx.x - intL] > 0){
     intMaxL = intMaxL < A[blockIdx.x * blockDim.x + threadIdx.x - intL] ? A[blockIdx.x * blockDim.x + threadIdx.x + intL] : intMaxL;
     intL++;
 }

 intMax =  intMaxR > intMaxL ? intMaxR : intMaxL;

 for(int i = 0; i < intR; i++){
     C[blockIdx.x * blockDim.x + threadIdx.x + i] = intMax;
 }
 for(int i = 0; i < intL; i++){
     C[blockIdx.x * blockDim.x + threadIdx.x - i] = intMax;
 }

顯然，即使使用共享內存，代碼也很慢，並且並沒有真正利用CUDA的並行特性。 有沒有人知道如何或是否可以在CUDA中有效地做到這一點？

提前致謝。

Answer 1

一種可能的方法是使用推力。

可能的順序如下：

使用推力::: reduce_by_key生成每個范圍的最大值。
使用推力:: adjacent_difference描繪每個范圍的開始
對步驟2的結果使用包含掃描，以生成聚集索引，即將用於選擇將在輸出矢量的每個位置中使用的減小的值（來自步驟1的結果）的索引。
使用推力:: gather_if ，使用步驟3中生成的聚集索引，有選擇地將減少的值放置在輸出向量中的適當位置（B向量中為1）。

這是一個完全有效的代碼，使用示例中的A和B向量演示了這一點：

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>

#define DSIZE 8

template <typename T>
struct abs_val : public thrust::unary_function<T, T>
{
  __host__ __device__
  T operator()(const T& x) const
  {
    if (x<0) return -x;
    else return x;
  }
};

template <typename T>
struct subtr : public thrust::unary_function<T, T>
{
  const T val;
  subtr(T _val): val(_val) {}
  __host__ __device__
  T operator()(const T& x) const
  {
    return  x-val;
  }
};

int main(){

  int A[DSIZE] = {45,21,764,234,7,0,12,55};
  int B[DSIZE] = {1,1,1,0,1,1,0,0};
  thrust::device_vector<int> dA(A, A+DSIZE);
  thrust::device_vector<int> dB(B, B+DSIZE);
  thrust::device_vector<int> dRed(DSIZE);
  thrust::device_vector<int> diffB(DSIZE);
  thrust::device_vector<int> dRes(DSIZE);

  thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<int>(), thrust::maximum<int>());
  thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
  thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), abs_val<int>(), thrust::plus<int>());
  thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), subtr<int>(B[0])), thrust::make_transform_iterator(diffB.end(), subtr<int>(B[0])), dB.begin(), dRed.begin(), dRes.begin());
  thrust::copy(dRes.begin(), dRes.end(), std::ostream_iterator<int>(std::cout, " "));
  std::cout  << std::endl;
  return 0;
}

有關示例的注釋：

reduce_by_key為B中的每個連續0個序列或 1個序列生成減少的值（最大值）。您實際上只需要1個序列的最大值。 我們將通過collect_if函數舍棄0個最大序列。
通過使用步驟2的向量結果的transform_iterator處理，我可以考慮B向量可以以1序列或0序列開頭的可能性，可以從每個聚集索引中減去B向量的第一個值。
neighbor_difference操作將產生1或-1來描繪新序列的開始。 我將abs_val函子的transform_inclusive_scan變量與abs_val函子一起使用，以平等對待它們，以進行掃描（即生成聚集索引）。
上面的代碼應產生與所需的C輸出向量匹配的結果，如下所示：
```
 $ nvcc -arch=sm_20 -o t53 t53.cu $ ./t53 764 764 764 0 7 7 0 0 $ 
```

我們可以使用推力::占位符來進一步簡化上面的代碼，從而不需要額外的函子定義：

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>

#define DSIZE 2000000
using namespace thrust::placeholders;

typedef int mytype;

int main(){

  mytype *A = (mytype *)malloc(DSIZE*sizeof(mytype));
  int *B = (int *)malloc(DSIZE*sizeof(int));
  for (int i = 0; i < DSIZE; i++){
    A[i] = (rand()/(float)RAND_MAX)*10.0f;
    B[i] = rand()%2;}
  thrust::device_vector<mytype> dA(A, A+DSIZE);
  thrust::device_vector<int> dB(B, B+DSIZE);
  thrust::device_vector<mytype> dRed(DSIZE);
  thrust::device_vector<int> diffB(DSIZE);
  thrust::device_vector<mytype> dRes(DSIZE);

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);
  thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<mytype>(), thrust::maximum<mytype>());
  thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
  thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), _1*_1, thrust::plus<int>());
  thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), _1 - B[0]), thrust::make_transform_iterator(diffB.end(), _1 - B[0]), dB.begin(), dRed.begin(), dRes.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  std::cout<< "elapsed time: " << et << "ms " << std::endl;
  thrust::copy(dRes.begin(), dRes.begin()+10, std::ostream_iterator<mytype>(std::cout, " "));
  std::cout  << std::endl;
  return 0;
}

（我修改了上面的占位符代碼，使其還包括生成更大尺寸的數據集以及一些基本的計時設備。）

使用CUDA查找數組中大小未知的區域的最大值

問題描述

1 個解決方案

解決方案1
2 已采納 2014-09-01 15:01:27

使用CUDA查找數組中大小未知的區域的最大值

問題描述

1 個解決方案

解決方案1 2 已采納 2014-09-01 15:01:27

解決方案1
2 已采納 2014-09-01 15:01:27