简体   繁体   English

使用CUDA查找数组中大小未知的区域的最大值

[英]Find Maximum Value of Regions of Unknown size in an Array using CUDA

Say I have an array A[4000] of values that contains all different numbers [45,21,764,234,7,0,12,55,...] 假设我有一个值A[4000]的数组,其中包含所有不同的数字[45,21,764,234,7,0,12,55,...]

Then I have another array B[4000] that denotes the location of regions in array A with the number 1 if it is a part of a region, and 0 if it is not. 然后,我有另一个数组B[4000] ,它表示数组A中区域的位置,如果它是区域的一部分,则数字为1 ,否则为0 If the 1's are next to each other that means they are part of the same region, if they are not next to each other (there is a 0 in between the 1's ) then they are part of a different region. 如果1's彼此相邻,则表示它们是同一区域的一部分,如果它们彼此不相邻( 1's之间存在0 ),则它们是不同区域的一部分。

ex. 恩。 B = [1,1,1,0,1,1,0,0...] Means that I want to find the maximum value in the region of the first three numbers in array A , and the maximum number in the 5th and 6th numbers in array A, etc. So that I can produce an array C[4000] that holds the maximum values of A in each of the regions denoted by B , and a 0 in the areas that are not part of the regions. B = [1,1,1,0,1,1,0,0...]意味着我想找到在的区域中的最大值first three numbers in array A中,最大数5th and 6th numbers in array A, etc.这样我就可以生成一个数组C[4000] ,该数组在用B表示的每个区域中保存A的最大值,在不属于该区域的区域中保存一个0

So in this case C = [764,764,764,0,7,7,0,0...] 因此,在这种情况下C = [764,764,764,0,7,7,0,0...]

There can be anywhere from 0 to 2,000 regions , and the length of the regions can range from 2 to 4,000 numbers long . 可以有0 to 2,000 regions0 to 2,000 regions的长度可以是2 to 4,000 numbers long I never know beforehand how many regions there are or the different sizes of the regions. 我永远不知道有多少个区域或区域的大小不同。

I have been trying to come up with a kernel in CUDA that can achieve this result. 我一直在尝试在CUDA中提出可以实现此结果的内核。 It needs to be done as fast as possible since it in reality it will be used for images, this is just a simplified example. 由于它实际上将用于图像,因此需要尽快完成,这只是一个简化的示例。 All of my ideas, such as using reduction, only work if there is only one region spans all 4000 numbers of array A . 我所有的想法,例如使用reduce,仅在只有一个区域跨越数组A所有4000数字时才有效。 However, I do not think that I can use reduction here because there can be multiple regions in the array separated by 1 to 3996 spaces ( 0's ) and reduction will cause me to loose track of the separated regions. 但是,我不认为我可以在此处使用缩小,因为数组中可以有多个区域,它们之间的间隔为13996空格( 0's ),并且缩小会使我失去对分离区域的跟踪。 Or, the kernel has far too many loops and if statements in it to be fast such as 或者,内核有太多的循环,并且其中的if语句太快,例如

int intR = 0;
 while(B[blockIdx.x * blockDim.x + threadIdx.x + intR] > 0){
     intMaxR = intMaxR < A[blockIdx.x * blockDim.x + threadIdx.x + intR] ? A[blockIdx.x * blockDim.x + threadIdx.x + intR] : intMaxR;
     intR++;
 }

 int intL = 0;
 while(B[blockIdx.x * blockDim.x + threadIdx.x - intL] > 0){
     intMaxL = intMaxL < A[blockIdx.x * blockDim.x + threadIdx.x - intL] ? A[blockIdx.x * blockDim.x + threadIdx.x + intL] : intMaxL;
     intL++;
 }

 intMax =  intMaxR > intMaxL ? intMaxR : intMaxL;

 for(int i = 0; i < intR; i++){
     C[blockIdx.x * blockDim.x + threadIdx.x + i] = intMax;
 }
 for(int i = 0; i < intL; i++){
     C[blockIdx.x * blockDim.x + threadIdx.x - i] = intMax;
 }

Clearly the code is slow even with shared memory, and isn't really taking advantage of the parallel nature of CUDA. 显然,即使使用共享内存,代码也很慢,并且并没有真正利用CUDA的并行特性。 Does anyone have any idea on how or if this can be done efficiently in CUDA? 有没有人知道如何或是否可以在CUDA中有效地做到这一点?

Thanks in advance. 提前致谢。

One possible approach would be to use thrust . 一种可能的方法是使用推力

A possible sequence would be like this: 可能的顺序如下:

  1. use thrust::reduce_by_key to generate the max values for each range. 使用推力::: reduce_by_key生成每个范围的最大值。
  2. use thrust::adjacent_difference to delineate the start of each range 使用推力:: adjacent_difference描绘每个范围的开始
  3. use an inclusive scan on the results of step 2 to generate the gather indices, ie the indices that will be used to select the reduced value (results from step 1) that will go in each location of the output vector. 对步骤2的结果使用包含扫描,以生成聚集索引,即将用于选择将在输出矢量的每个位置中使用的减小的值(来自步骤1的结果)的索引。
  4. Use thrust::gather_if to selectively place the reduced values into appropriate locations (where there is a 1 in the B vector) in the output vector, using the gather indices generated in step 3. 使用推力:: gather_if ,使用步骤3中生成的聚集索引,有选择地将减少的值放置在输出向量中的适当位置(B向量中为1)。

Here's a fully worked code demonstrating this, using A and B vectors like your example: 这是一个完全有效的代码,使用示例中的A和B向量演示了这一点:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>

#define DSIZE 8

template <typename T>
struct abs_val : public thrust::unary_function<T, T>
{
  __host__ __device__
  T operator()(const T& x) const
  {
    if (x<0) return -x;
    else return x;
  }
};

template <typename T>
struct subtr : public thrust::unary_function<T, T>
{
  const T val;
  subtr(T _val): val(_val) {}
  __host__ __device__
  T operator()(const T& x) const
  {
    return  x-val;
  }
};

int main(){

  int A[DSIZE] = {45,21,764,234,7,0,12,55};
  int B[DSIZE] = {1,1,1,0,1,1,0,0};
  thrust::device_vector<int> dA(A, A+DSIZE);
  thrust::device_vector<int> dB(B, B+DSIZE);
  thrust::device_vector<int> dRed(DSIZE);
  thrust::device_vector<int> diffB(DSIZE);
  thrust::device_vector<int> dRes(DSIZE);

  thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<int>(), thrust::maximum<int>());
  thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
  thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), abs_val<int>(), thrust::plus<int>());
  thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), subtr<int>(B[0])), thrust::make_transform_iterator(diffB.end(), subtr<int>(B[0])), dB.begin(), dRed.begin(), dRes.begin());
  thrust::copy(dRes.begin(), dRes.end(), std::ostream_iterator<int>(std::cout, " "));
  std::cout  << std::endl;
  return 0;
}

Notes about the example: 有关示例的注释:

  1. reduce_by_key is generating reduced values (maximums) for each consecutive 0 sequence or 1 sequence in B. You only really need the maximums for the 1 sequences. reduce_by_key为B中的每个连续0个序列 1个序列生成减少的值(最大值)。您实际上只需要1个序列的最大值。 We will discard the 0 sequence maximums via the gather_if function. 我们将通过collect_if函数舍弃0个最大序列。
  2. I allow for the possibility that the B vector may start with either a 1 sequence or a 0 sequence, by using the transform_iterator treatment of the vector result of step 2, subtracting the first value of the B vector from each gather index. 通过使用步骤2的向量结果的transform_iterator处理,我可以考虑B向量可以以1序列或0序列开头的可能性,可以从每个聚集索引中减去B向量的第一个值。
  3. The adjacent_difference operation will produce either a 1 or -1 to delineate the start of a new sequence. neighbor_difference操作将产生1或-1来描绘新序列的开始。 I use the transform_inclusive_scan variant with the abs_val functor to treat these equally, for scan purposes (ie generation of gather indices). 我将abs_val函子的transform_inclusive_scan变量与abs_val函子一起使用,以平等对待它们,以进行扫描(即生成聚集索引)。
  4. The above code should produce results matching your desired C output vector, like this: 上面的代码应产生与所需的C输出向量匹配的结果,如下所示:

     $ nvcc -arch=sm_20 -o t53 t53.cu $ ./t53 764 764 764 0 7 7 0 0 $ 

We can use thrust::placeholders to further simplify the above code, eliminating the need for the extra functor definitions: 我们可以使用推力::占位符来进一步简化上面的代码,从而不需要额外的函子定义:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>

#define DSIZE 2000000
using namespace thrust::placeholders;

typedef int mytype;

int main(){

  mytype *A = (mytype *)malloc(DSIZE*sizeof(mytype));
  int *B = (int *)malloc(DSIZE*sizeof(int));
  for (int i = 0; i < DSIZE; i++){
    A[i] = (rand()/(float)RAND_MAX)*10.0f;
    B[i] = rand()%2;}
  thrust::device_vector<mytype> dA(A, A+DSIZE);
  thrust::device_vector<int> dB(B, B+DSIZE);
  thrust::device_vector<mytype> dRed(DSIZE);
  thrust::device_vector<int> diffB(DSIZE);
  thrust::device_vector<mytype> dRes(DSIZE);

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);
  thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<mytype>(), thrust::maximum<mytype>());
  thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
  thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), _1*_1, thrust::plus<int>());
  thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), _1 - B[0]), thrust::make_transform_iterator(diffB.end(), _1 - B[0]), dB.begin(), dRed.begin(), dRes.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  std::cout<< "elapsed time: " << et << "ms " << std::endl;
  thrust::copy(dRes.begin(), dRes.begin()+10, std::ostream_iterator<mytype>(std::cout, " "));
  std::cout  << std::endl;
  return 0;
}

(I've modified the above placeholders code to also include generation of a larger size data set, as well as some basic timing apparatus.) (我修改了上面的占位符代码,使其还包括生成更大尺寸的数据集以及一些基本的计时设备。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM