[英]Find Maximum Value of Regions of Unknown size in an Array using CUDA
Say I have an array A[4000]
of values that contains all different numbers [45,21,764,234,7,0,12,55,...]
假设我有一个值A[4000]
的数组,其中包含所有不同的数字[45,21,764,234,7,0,12,55,...]
Then I have another array B[4000]
that denotes the location of regions in array A
with the number 1
if it is a part of a region, and 0
if it is not. 然后,我有另一个数组B[4000]
,它表示数组A
中区域的位置,如果它是区域的一部分,则数字为1
,否则为0
。 If the 1's
are next to each other that means they are part of the same region, if they are not next to each other (there is a 0
in between the 1's
) then they are part of a different region. 如果1's
彼此相邻,则表示它们是同一区域的一部分,如果它们彼此不相邻( 1's
之间存在0
),则它们是不同区域的一部分。
ex. 恩。 B = [1,1,1,0,1,1,0,0...]
Means that I want to find the maximum value in the region of the first three numbers in array A
, and the maximum number in the 5th and 6th numbers in array A, etc.
So that I can produce an array C[4000]
that holds the maximum values of A
in each of the regions denoted by B
, and a 0
in the areas that are not part of the regions. B = [1,1,1,0,1,1,0,0...]
意味着我想找到在的区域中的最大值first three numbers in array A
中,最大数5th and 6th numbers in array A, etc.
这样我就可以生成一个数组C[4000]
,该数组在用B
表示的每个区域中保存A
的最大值,在不属于该区域的区域中保存一个0
。
So in this case C = [764,764,764,0,7,7,0,0...]
因此,在这种情况下C = [764,764,764,0,7,7,0,0...]
There can be anywhere from 0 to 2,000 regions
, and the length of the regions can range from 2 to 4,000 numbers long
. 可以有0 to 2,000 regions
, 0 to 2,000 regions
的长度可以是2 to 4,000 numbers long
。 I never know beforehand how many regions there are or the different sizes of the regions. 我永远不知道有多少个区域或区域的大小不同。
I have been trying to come up with a kernel in CUDA that can achieve this result. 我一直在尝试在CUDA中提出可以实现此结果的内核。 It needs to be done as fast as possible since it in reality it will be used for images, this is just a simplified example. 由于它实际上将用于图像,因此需要尽快完成,这只是一个简化的示例。 All of my ideas, such as using reduction, only work if there is only one region spans all 4000
numbers of array A
. 我所有的想法,例如使用reduce,仅在只有一个区域跨越数组A
所有4000
数字时才有效。 However, I do not think that I can use reduction here because there can be multiple regions in the array separated by 1
to 3996
spaces ( 0's
) and reduction will cause me to loose track of the separated regions. 但是,我不认为我可以在此处使用缩小,因为数组中可以有多个区域,它们之间的间隔为1
到3996
空格( 0's
),并且缩小会使我失去对分离区域的跟踪。 Or, the kernel has far too many loops and if statements in it to be fast such as 或者,内核有太多的循环,并且其中的if语句太快,例如
int intR = 0;
while(B[blockIdx.x * blockDim.x + threadIdx.x + intR] > 0){
intMaxR = intMaxR < A[blockIdx.x * blockDim.x + threadIdx.x + intR] ? A[blockIdx.x * blockDim.x + threadIdx.x + intR] : intMaxR;
intR++;
}
int intL = 0;
while(B[blockIdx.x * blockDim.x + threadIdx.x - intL] > 0){
intMaxL = intMaxL < A[blockIdx.x * blockDim.x + threadIdx.x - intL] ? A[blockIdx.x * blockDim.x + threadIdx.x + intL] : intMaxL;
intL++;
}
intMax = intMaxR > intMaxL ? intMaxR : intMaxL;
for(int i = 0; i < intR; i++){
C[blockIdx.x * blockDim.x + threadIdx.x + i] = intMax;
}
for(int i = 0; i < intL; i++){
C[blockIdx.x * blockDim.x + threadIdx.x - i] = intMax;
}
Clearly the code is slow even with shared memory, and isn't really taking advantage of the parallel nature of CUDA. 显然,即使使用共享内存,代码也很慢,并且并没有真正利用CUDA的并行特性。 Does anyone have any idea on how or if this can be done efficiently in CUDA? 有没有人知道如何或是否可以在CUDA中有效地做到这一点?
Thanks in advance. 提前致谢。
One possible approach would be to use thrust . 一种可能的方法是使用推力 。
A possible sequence would be like this: 可能的顺序如下:
Here's a fully worked code demonstrating this, using A and B vectors like your example: 这是一个完全有效的代码,使用示例中的A和B向量演示了这一点:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>
#define DSIZE 8
template <typename T>
struct abs_val : public thrust::unary_function<T, T>
{
__host__ __device__
T operator()(const T& x) const
{
if (x<0) return -x;
else return x;
}
};
template <typename T>
struct subtr : public thrust::unary_function<T, T>
{
const T val;
subtr(T _val): val(_val) {}
__host__ __device__
T operator()(const T& x) const
{
return x-val;
}
};
int main(){
int A[DSIZE] = {45,21,764,234,7,0,12,55};
int B[DSIZE] = {1,1,1,0,1,1,0,0};
thrust::device_vector<int> dA(A, A+DSIZE);
thrust::device_vector<int> dB(B, B+DSIZE);
thrust::device_vector<int> dRed(DSIZE);
thrust::device_vector<int> diffB(DSIZE);
thrust::device_vector<int> dRes(DSIZE);
thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<int>(), thrust::maximum<int>());
thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), abs_val<int>(), thrust::plus<int>());
thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), subtr<int>(B[0])), thrust::make_transform_iterator(diffB.end(), subtr<int>(B[0])), dB.begin(), dRed.begin(), dRes.begin());
thrust::copy(dRes.begin(), dRes.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
Notes about the example: 有关示例的注释:
The above code should produce results matching your desired C
output vector, like this: 上面的代码应产生与所需的C
输出向量匹配的结果,如下所示:
$ nvcc -arch=sm_20 -o t53 t53.cu $ ./t53 764 764 764 0 7 7 0 0 $
We can use thrust::placeholders to further simplify the above code, eliminating the need for the extra functor definitions: 我们可以使用推力::占位符来进一步简化上面的代码,从而不需要额外的函子定义:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/adjacent_difference.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/functional.h>
#define DSIZE 2000000
using namespace thrust::placeholders;
typedef int mytype;
int main(){
mytype *A = (mytype *)malloc(DSIZE*sizeof(mytype));
int *B = (int *)malloc(DSIZE*sizeof(int));
for (int i = 0; i < DSIZE; i++){
A[i] = (rand()/(float)RAND_MAX)*10.0f;
B[i] = rand()%2;}
thrust::device_vector<mytype> dA(A, A+DSIZE);
thrust::device_vector<int> dB(B, B+DSIZE);
thrust::device_vector<mytype> dRed(DSIZE);
thrust::device_vector<int> diffB(DSIZE);
thrust::device_vector<mytype> dRes(DSIZE);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
thrust::reduce_by_key(dB.begin(), dB.end(), dA.begin(), thrust::make_discard_iterator(), dRed.begin(), thrust::equal_to<mytype>(), thrust::maximum<mytype>());
thrust::adjacent_difference(dB.begin(), dB.end(), diffB.begin());
thrust::transform_inclusive_scan(diffB.begin(), diffB.end(), diffB.begin(), _1*_1, thrust::plus<int>());
thrust::gather_if(thrust::make_transform_iterator(diffB.begin(), _1 - B[0]), thrust::make_transform_iterator(diffB.end(), _1 - B[0]), dB.begin(), dRed.begin(), dRes.begin());
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float et;
cudaEventElapsedTime(&et, start, stop);
std::cout<< "elapsed time: " << et << "ms " << std::endl;
thrust::copy(dRes.begin(), dRes.begin()+10, std::ostream_iterator<mytype>(std::cout, " "));
std::cout << std::endl;
return 0;
}
(I've modified the above placeholders code to also include generation of a larger size data set, as well as some basic timing apparatus.) (我修改了上面的占位符代码,使其还包括生成更大尺寸的数据集以及一些基本的计时设备。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.