[英]Get unique elements of multiple arrays in CUDA
Here is the problem: There number of arrays, for example, 2000 arrays, but only 256 integers in each array.问题在于:数组的数量,例如,2000 个数组,但每个数组中只有 256 个整数。 And the range of the integers is quite considerable, [0, 1000000] for instance.
并且整数的范围相当可观,例如 [0, 1000000]。
I want to get the unique elements for each array, in other words, remove the duplicate elements.我想获取每个数组的唯一元素,换句话说,删除重复的元素。 I have 2 solutions:
我有两个解决方案:
Use Thrust to get the unique element for every array, so I have to do 2000 times thrust::unique
.使用 Thrust 获取每个数组的唯一元素,因此我必须执行 2000 次
thrust::unique
。 But each array is pretty small, this way may not get a good performance.但是每个数组都很小,这种方式可能得不到很好的性能。
Implement hash table in cuda kernel, use 2000 blocks, 256 thread in each block.在 cuda 内核中实现哈希表,使用 2000 个块,每个块中使用 256 个线程。 And make use of the shared memory to implement to hash table, then every single block will produce a element-unique array.
并利用共享内存来实现哈希表,那么每个块都会产生一个元素唯一的数组。
The above two methods seem unprofessional, are there elegant ways to solve the problem by CUDA ?以上两种方法看起来不专业,请问CUDA有没有优雅的方法可以解决问题?
You can use thrust::unique
if you modify your data similar like it is done in this SO question: Segmented Sort with CUDPP/Thrust如果您像在此问题中所做的那样修改数据,则可以使用
thrust::unique
: Segmented Sort with CUDPP/Thrust
For simplification, let's assume each array contains per_array
elements and there is a total of array_num
arrays.为简单
per_array
,我们假设每个数组包含per_array
元素,并且总共有array_num
数组。 Each element is in the range [0,max_element]
.每个元素都在
[0,max_element]
范围内。
Demo data
with per_array=4
, array_num=3
and max_element=2
could look like this:具有
per_array=4
、 array_num=3
和max_element=2
演示data
可能如下所示:
data = {1,0,1,2},{2,2,0,0},{0,0,0,0}
To denote the membership of each element to the respective array we use the following flags
:为了表示每个元素对相应数组的成员资格,我们使用以下
flags
:
flags = {0,0,0,0},{1 1 1 1},{2,2,2,2}
In order to get unique elements per array of the segmented dataset we need to do the following steps:为了获得分割数据集的每个数组的唯一元素,我们需要执行以下步骤:
Transform data
so the elements of each array i
are within the unique range [i*2*max_element,i*2*max_element+max_element]
转换
data
使每个数组i
的元素都在唯一的范围内[i*2*max_element,i*2*max_element+max_element]
data = data + flags*2*max_element data = {1,0,1,2},{6,6,4,4},{8,8,8,8}
Sort the transformed data:对转换后的数据进行排序:
data = {0,0,1,2},{4,4,6,6},{8,8,8,8}
Apply thrust::unique_by_key
using data
as keys and flags
as values:使用
data
作为键和flags
作为值来应用thrust::unique_by_key
:
data = {0,1,2}{4,6}{8} flags = {0,0,0}{1,1}{2}
Transform data
back to the original values:将
data
转换回原始值:
data = data - flags*2*max_element data = {0,1,2}{0,2}{0}
The maximum value of max_element
is bounded by the size of the integer used for representing data
. max_element
的最大值受用于表示data
的整数大小的限制。 If it is an unsigned integer with n
bits:如果它是一个
n
位的无符号整数:
max_max_element(n,array_num) = 2^n/(2*(array_num-1)+1)
Given your array_num=2000
, you will get the following limits for 32bit and 64bit unsigned integers:鉴于您的
array_num=2000
,您将获得 32 位和 64 位无符号整数的以下限制:
max_max_element(32,2000) = 1074010
max_max_element(64,2000) = 4612839228234447
The following code implements the above steps:以下代码实现了上述步骤:
unique_per_array.cu unique_per_array.cu
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iostream>
#include <cstdint>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl;
}
int main()
{
typedef uint32_t Integer;
const std::size_t per_array = 4;
const std::size_t array_num = 3;
const std::size_t total_count = array_num * per_array;
Integer demo_data[] = {1,0,1,2,2,2,0,0,0,0,0,0};
thrust::device_vector<Integer> data(demo_data, demo_data+total_count);
PRINTER(data);
// if max_element is known for your problem,
// you don't need the following operation
Integer max_element = *(thrust::max_element(data.begin(), data.end()));
std::cout << "max_element=" << max_element << std::endl;
using namespace thrust::placeholders;
// create the flags
// could be a smaller integer type as well
thrust::device_vector<uint32_t> flags(total_count);
thrust::counting_iterator<uint32_t> flags_cit(0);
thrust::transform(flags_cit,
flags_cit + total_count,
flags.begin(),
_1 / per_array);
PRINTER(flags);
// 1. transform data into unique ranges
thrust::transform(data.begin(),
data.end(),
thrust::counting_iterator<Integer>(0),
data.begin(),
_1 + (_2/per_array)*2*max_element);
PRINTER(data);
// 2. sort the transformed data
thrust::sort(data.begin(), data.end());
PRINTER(data);
// 3. eliminate duplicates per array
auto new_end = thrust::unique_by_key(data.begin(),
data.end(),
flags.begin());
uint32_t new_size = new_end.first - data.begin();
data.resize(new_size);
flags.resize(new_size);
PRINTER(data);
PRINTER(flags);
// 4. transform data back
thrust::transform(data.begin(),
data.end(),
flags.begin(),
data.begin(),
_1 - _2*2*max_element);
PRINTER(data);
}
Compiling and running yields:编译和运行产量:
$ nvcc -std=c++11 unique_per_array.cu -o unique_per_array && ./unique_per_array
data: 1 0 1 2 2 2 0 0 0 0 0 0
max_element=2
flags: 0 0 0 0 1 1 1 1 2 2 2 2
data: 1 0 1 2 6 6 4 4 8 8 8 8
data: 0 1 1 2 4 4 6 6 8 8 8 8
data: 0 1 2 4 6 8
flags: 0 0 0 1 1 2
data: 0 1 2 0 2 0
One more thing:还有一件事:
In the thrust development version there is an improvement implemented for thrust::unique*
which improves performance by around 25 % .在推力开发版本中,对
thrust::unique*
进行了改进,将性能提高了大约 25% 。 You might want to try this version if you aim for better performance.如果您的目标是更好的性能,您可能想尝试这个版本。
我认为推力::unique_copy()可以帮助您做到这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.