简体   繁体   English

在 CUDA 中获取多个数组的唯一元素

[英]Get unique elements of multiple arrays in CUDA

Here is the problem: There number of arrays, for example, 2000 arrays, but only 256 integers in each array.问题在于:数组的数量,例如,2000 个数组,但每个数组中只有 256 个整数。 And the range of the integers is quite considerable, [0, 1000000] for instance.并且整数的范围相当可观,例如 [0, 1000000]。

I want to get the unique elements for each array, in other words, remove the duplicate elements.我想获取每个数组的唯一元素,换句话说,删除重复的元素。 I have 2 solutions:我有两个解决方案:

  1. Use Thrust to get the unique element for every array, so I have to do 2000 times thrust::unique .使用 Thrust 获取每个数组的唯一元素,因此我必须执行 2000 次thrust::unique But each array is pretty small, this way may not get a good performance.但是每个数组都很小,这种方式可能得不到很好的性能。

  2. Implement hash table in cuda kernel, use 2000 blocks, 256 thread in each block.在 cuda 内核中实现哈希表,使用 2000 个块,每个块中使用 256 个线程。 And make use of the shared memory to implement to hash table, then every single block will produce a element-unique array.并利用共享内存来实现哈希表,那么每个块都会产生一个元素唯一的数组。

The above two methods seem unprofessional, are there elegant ways to solve the problem by CUDA ?以上两种方法看起来不专业,请问CUDA有没有优雅的方法可以解决问题?

You can use thrust::unique if you modify your data similar like it is done in this SO question: Segmented Sort with CUDPP/Thrust如果您像在此问题中所做的那样修改数据,则可以使用thrust::uniqueSegmented Sort with CUDPP/Thrust

For simplification, let's assume each array contains per_array elements and there is a total of array_num arrays.为简单per_array ,我们假设每个数组包含per_array元素,并且总共有array_num数组。 Each element is in the range [0,max_element] .每个元素都在[0,max_element]范围内。

Demo data with per_array=4 , array_num=3 and max_element=2 could look like this:具有per_array=4array_num=3max_element=2演示data可能如下所示:

data = {1,0,1,2},{2,2,0,0},{0,0,0,0}

To denote the membership of each element to the respective array we use the following flags :为了表示每个元素对相应数组的成员资格,我们使用以下flags

flags = {0,0,0,0},{1 1 1 1},{2,2,2,2}

In order to get unique elements per array of the segmented dataset we need to do the following steps:为了获得分割数据集的每个数组的唯一元素,我们需要执行以下步骤:

  1. Transform data so the elements of each array i are within the unique range [i*2*max_element,i*2*max_element+max_element]转换data使每个数组i的元素都在唯一的范围内[i*2*max_element,i*2*max_element+max_element]

     data = data + flags*2*max_element data = {1,0,1,2},{6,6,4,4},{8,8,8,8}
  2. Sort the transformed data:对转换后的数据进行排序:

     data = {0,0,1,2},{4,4,6,6},{8,8,8,8}
  3. Apply thrust::unique_by_key using data as keys and flags as values:使用data作为键和flags作为值来应用thrust::unique_by_key

     data = {0,1,2}{4,6}{8} flags = {0,0,0}{1,1}{2}
  4. Transform data back to the original values:data转换回原始值:

     data = data - flags*2*max_element data = {0,1,2}{0,2}{0}

The maximum value of max_element is bounded by the size of the integer used for representing data . max_element的最大值受用于表示data的整数大小的限制。 If it is an unsigned integer with n bits:如果它是一个n位的无符号整数:

max_max_element(n,array_num) = 2^n/(2*(array_num-1)+1)

Given your array_num=2000 , you will get the following limits for 32bit and 64bit unsigned integers:鉴于您的array_num=2000 ,您将获得 32 位和 64 位无符号整数的以下限制:

max_max_element(32,2000) = 1074010
max_max_element(64,2000) = 4612839228234447

The following code implements the above steps:以下代码实现了上述步骤:

unique_per_array.cu unique_per_array.cu

#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/copy.h>

#include <iostream>
#include <cstdint>

#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
    std::cout << name << ":\t";
    thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
    std::cout << std::endl;
}

int main()
{ 
    typedef uint32_t Integer;

    const std::size_t per_array = 4;
    const std::size_t array_num = 3;

    const std::size_t total_count = array_num * per_array;

    Integer demo_data[] = {1,0,1,2,2,2,0,0,0,0,0,0};

    thrust::device_vector<Integer> data(demo_data, demo_data+total_count);    

    PRINTER(data);

    // if max_element is known for your problem,
    // you don't need the following operation 
    Integer max_element = *(thrust::max_element(data.begin(), data.end()));
    std::cout << "max_element=" << max_element << std::endl;

    using namespace thrust::placeholders;

    // create the flags

    // could be a smaller integer type as well
    thrust::device_vector<uint32_t> flags(total_count);

    thrust::counting_iterator<uint32_t> flags_cit(0);

    thrust::transform(flags_cit,
                      flags_cit + total_count,
                      flags.begin(),
                      _1 / per_array);
    PRINTER(flags);


    // 1. transform data into unique ranges  
    thrust::transform(data.begin(),
                      data.end(),
                      thrust::counting_iterator<Integer>(0),
                      data.begin(),
                      _1 + (_2/per_array)*2*max_element);
    PRINTER(data);

    // 2. sort the transformed data
    thrust::sort(data.begin(), data.end());
    PRINTER(data);

    // 3. eliminate duplicates per array
    auto new_end = thrust::unique_by_key(data.begin(),
                                         data.end(),
                                         flags.begin());

    uint32_t new_size = new_end.first - data.begin();
    data.resize(new_size);
    flags.resize(new_size);

    PRINTER(data);
    PRINTER(flags);

    // 4. transform data back
    thrust::transform(data.begin(),
                      data.end(),
                      flags.begin(),
                      data.begin(),
                      _1 - _2*2*max_element);

    PRINTER(data);

}    

Compiling and running yields:编译和运行产量:

$ nvcc -std=c++11 unique_per_array.cu -o unique_per_array && ./unique_per_array

data:   1   0   1   2   2   2   0   0   0   0   0   0   
max_element=2
flags:  0   0   0   0   1   1   1   1   2   2   2   2   
data:   1   0   1   2   6   6   4   4   8   8   8   8   
data:   0   1   1   2   4   4   6   6   8   8   8   8   
data:   0   1   2   4   6   8   
flags:  0   0   0   1   1   2   
data:   0   1   2   0   2   0   

One more thing:还有一件事:

In the thrust development version there is an improvement implemented for thrust::unique* which improves performance by around 25 % .推力开发版本中,对thrust::unique*进行了改进,将性能提高了大约 25% You might want to try this version if you aim for better performance.如果您的目标是更好的性能,您可能想尝试这个版本。

我认为推力::unique_copy()可以帮助您做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM