简体   繁体   English

使用CUDA Thrust确定每个矩阵列中的最小元素及其位置

[英]Determining the least element and its position in each matrix column with CUDA Thrust

I have a fairly simple problem but I cannot figure out an elegant solution to it. 我有一个相当简单的问题,但我无法找到一个优雅的解决方案。

I have a Thrust code which produces c vectors of same size containing values. 我有一个Thrust代码,它生成包含值的相同大小的c向量。 Let say each of these c vectors have an index. 假设这些c向量中的每一个都具有索引。 I would like for each vector position to get the index of the c vector for which the value is the lowest: 我想为每个向量位置获取值为最低的c向量的索引:

Example: 例:

C0 =     (0,10,20,3,40)
C1 =     (1,2 ,3 ,5,10)

I would get as result a vector containing the index of the C vector which has the lowest value: 我会得到一个包含C矢量索引的向量,该向量具有最低值:

result = (0,1 ,1 ,0,1)

I have thought about doing it using thrust zip iterators, but have come accross issues: I could zip all the c vectors and implement an arbitrary transformation which takes a tuple and returns the index of its lowest value, but: 我曾经考虑过使用推力拉链迭代器来实现它,但是遇到了各种问题:我可以压缩所有c向量并实现任意转换,它接受一个元组并返回其最低值的索引,但是:

  1. How to iterate over the contents of a tuple? 如何迭代元组的内容?
  2. As I understand tuples can only store up to 10 elements and there can be much more than 10 c vectors. 据我所知,元组最多只能存储10元素,并且可以有超过10 c矢量。

I have then thought about doing it this way: Instead of having c separate vectors, append them all in a single vector C , then generate keys referencing the positions and perform a stable sort by key which will regroup the vector entries from a same position together. 然后我有想过做这种方式:代替具有c分开的载体,附加这些都在一个单一向量C ,然后生成密钥引用该位置,并执行由关键一个稳定的排序,其将来自相同的位置重新组合矢量的条目一起。 In the example that would give: 在示例中,将给出:

C =      (0,10,20,3,40,1,2,3,5,10)
keys =   (0,1 ,2 ,3,4 ,0,1,2,3,4 )
after stable sort by key:
output = (0,1,10,2,20,3,3,5,40,10)
keys =   (0,0,1 ,1,2 ,2,3,3,4 ,4 )

Then generate keys with the positions in the vector, zip the output with the index of the c vectors and then perform a reduce by key with a custom functor which for each reduction outputs the index with the lowest value. 然后使用向量中的位置生成关键字,使用c向量的索引对输出进行压缩,然后使用自定义函数执行按键缩减,对于每个约简,输出具有最低值的索引。 In the example: 在示例中:

input =  (0,1,10,2,20,3,3,5,40,10)
indexes= (0,1,0 ,1,0 ,1,0,1,0 ,1)
keys =   (0,0,1 ,1,2 ,2,3,3,4 ,4)
after reduce by keys on zipped input and indexes:
output = (0,1,1,0,1)

However, how to write such functor for the reduce by key operation? 但是,如何通过键操作来编写这样的仿函数呢?

One possible idea, building on the vectorized sort idea here 一个可能的想法,建立在这里的矢量化排序的想法

  1. Suppose I have vectors like this: 假设我有这样的向量:

     values: C = ( 0,10,20, 3,40, 1, 2, 3, 5,10) keys: K = ( 0, 1, 2, 3, 4, 0, 1, 2, 3, 4) segments: S = ( 0, 0, 0, 0, 0, 1, 1, 1, 1, 1) 
  2. zip together K and S to create KS 将K和S压缩在一起以创建KS

  3. stable_sort_by_key using C as the keys, and KS as the values: stable_sort_by_key使用C作为键,KS作为值:

     stable_sort_by_key(C.begin(), C.end(), KS_begin); 
  4. zip together the reordered C and K vectors, to create CK 将重新排序的C和K向量压缩在一起,以创建CK

  5. stable_sort_by_key using the reordered S as the keys, and CK as the values: stable_sort_by_key使用重新排序的S作为键,CK作为值:

     stable_sort_by_key(S.begin(), S.end(), CK_begin); 
  6. use a permutation iterator or a strided range iterator to access every Nth element (0, N, 2N, ...) of the newly re-ordered K vector, to retrieve a vector of the indices of the min element in each segment, where N is the length of the segments. 使用置换迭代器跨步范围迭代器来访问新重新排序的K向量的每个第N个元素(0,N,2N,...),以检索每个段中min元素的索引的向量,其中N是段的长度。

I haven't actually implemented this, right now it's just an idea. 我实际上没有实现这个,现在它只是一个想法。 Maybe it won't work for some reason I haven't observed yet. 也许由于某种原因我还没有观察到它会起作用。

segments ( S ) and keys ( K ) are effectively row and column indices. segmentsS )和keysK )实际上是行和列索引。

And your question seems wierd to me, because your title mentions "find index of max value" but most of your question seems to be referring to "lowest value". 你的问题对我来说似乎很奇怪,因为你的标题提到“找到最大值的索引”,但你的大部分问题似乎都是指“最低价值”。 Regardless, with a change to step 6 of my algorithm, you can find either value. 无论如何,通过更改我的算法的第6步,您可以找到任一值。

Since the length of your vectors has to be the same. 因为矢量的长度必须相同。 It's better to concatenate them together and treat them as a matrix C. 将它们连接在一起并将它们视为矩阵C更好。

Then your problem becomes finding the indices of the min element of each column in a row-major matrix. 然后你的问题就是找到行主矩阵中每列的min元素的索引。 It can be solved as follows. 它可以解决如下。

  1. change the row-major to col-major; 将row-major更改为col-major;
  2. find indices for each column. 找到每列的索引。

In step 1, you proposed to use stable_sort_by_key to rearrange the element order, which is not a effective method. 在步骤1中,您建议使用stable_sort_by_key重新排列元素顺序,这不是一种有效的方法。 Since the rearrangement can be directly calculated given the #row and #col of the matrix. 由于可以在给定矩阵的#row和#col的情况下直接计算重排。 In thrust, it can be done with permutation iterators as: 在推力中,可以使用置换迭代器来完成:

thrust::make_permutation_iterator(
    c.begin(),
    thrust::make_transform_iterator(
        thrust::make_counting_iterator((int) 0),
        (_1 % row) * col + _1 / row)
)

In step 2, reduce_by_key can do exactly what you want. 在第2步中, reduce_by_key可以完全按照您的意愿执行。 In your case the reduction binary-op functor is easy, since comparison on tuple (element of your zipped vector) has already been defined to compare the 1st element of the tuple, and it's supported by thrust as 在你的情况下,简化二元op函子是很容易的,因为已经定义了元组(压缩矢量的元素)的比较来比较元组的第一个元素,并且它被推力支持

thrust::minimum< thrust::tuple<float, int> >()

The whole program is shown as follows. 整个程序如下所示。 Thrust 1.6.0+ is required since I use placeholders in fancy iterators. 由于我在花式迭代器中使用占位符,因此需要Thrust 1.6.0+。

#include <iterator>
#include <algorithm>

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>

using namespace thrust::placeholders;

int main()
{

    const int row = 2;
    const int col = 5;
    float initc[] =
            { 0, 10, 20, 3, 40, 1, 2, 3, 5, 10 };
    thrust::device_vector<float> c(initc, initc + row * col);

    thrust::device_vector<float> minval(col);
    thrust::device_vector<int> minidx(col);

    thrust::reduce_by_key(
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / row),
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / row) + row * col,
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            thrust::make_permutation_iterator(
                                    c.begin(),
                                    thrust::make_transform_iterator(
                                            thrust::make_counting_iterator((int) 0), (_1 % row) * col + _1 / row)),
                            thrust::make_transform_iterator(
                                    thrust::make_counting_iterator((int) 0), _1 % row))),
            thrust::make_discard_iterator(),
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            minval.begin(),
                            minidx.begin())),
            thrust::equal_to<int>(),
            thrust::minimum<thrust::tuple<float, int> >()
    );

    std::copy(minidx.begin(), minidx.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << std::endl;
    return 0;
}

Two remaining issues may affect the performance. 剩下的两个问题可能会影响性能。

  1. min values have to be outputted, which is not required; 必须输出最小值,这不是必需的;
  2. reduce_by_key is designed for segments with variant lengths, it may not be the fastest algorithm for reduction on segments with same length. reduce_by_key是为具有不同长度的段设计的,它可能不是减少具有相同长度的段的最快算法。

Writing your own kernel could be the best solution for highest performance. 编写自己的内核可能是获得最高性能的最佳解决方案。

I had the curiosity to test which one of the previous approaches was faster. 我有好奇心去测试以前哪种方法更快。 So, I implemented Robert Crovella's idea in the code below which reports, for the sake of completeness, also Eric's approach. 因此,我在下面的代码中实现了Robert Crovella的想法,为了完整起见,还报告了Eric的方法。

#include <iterator>
#include <algorithm>

#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/sort.h>

#include "TimingGPU.cuh"

using namespace thrust::placeholders;

template <typename Iterator>
class strided_range
{
    public:

    typedef typename thrust::iterator_difference<Iterator>::type difference_type;

    struct stride_functor : public thrust::unary_function<difference_type,difference_type>
    {
        difference_type stride;

        stride_functor(difference_type stride)
            : stride(stride) {}

        __host__ __device__
        difference_type operator()(const difference_type& i) const
        { 
            return stride * i;
        }
    };

    typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
    typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator;
    typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;

    // type of the strided_range iterator
    typedef PermutationIterator iterator;

    // construct strided_range for the range [first,last)
    strided_range(Iterator first, Iterator last, difference_type stride)
        : first(first), last(last), stride(stride) {}

    iterator begin(void) const
    {
        return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride)));
    }

    iterator end(void) const
    {
        return begin() + ((last - first) + (stride - 1)) / stride;
    }

    protected:
    Iterator first;
    Iterator last;
    difference_type stride;
};


/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template< typename T >
struct mod_functor {
    __host__ __device__ T operator()(T a, T b) { return a % b; }
};

/********/
/* MAIN */
/********/
int main()
{
    /***********************/
    /* SETTING THE PROBLEM */
    /***********************/
    const int Nrows = 200;
    const int Ncols = 200;

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /******************/
    /* APPROACH NR. 1 */
    /******************/
    timerGPU.StartCounter();

    thrust::device_vector<float>    d_min_values(Ncols);
    thrust::device_vector<int>      d_min_indices_1(Ncols);

    thrust::reduce_by_key(
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / Nrows),
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / Nrows) + Nrows * Ncols,
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            thrust::make_permutation_iterator(
                                    d_matrix.begin(),
                                    thrust::make_transform_iterator(
                                            thrust::make_counting_iterator((int) 0), (_1 % Nrows) * Ncols + _1 / Nrows)),
                            thrust::make_transform_iterator(
                                    thrust::make_counting_iterator((int) 0), _1 % Nrows))),
            thrust::make_discard_iterator(),
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            d_min_values.begin(),
                            d_min_indices_1.begin())),
            thrust::equal_to<int>(),
            thrust::minimum<thrust::tuple<float, int> >()
    );

    printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());

    /******************/
    /* APPROACH NR. 2 */
    /******************/
    timerGPU.StartCounter();

    // --- Computing row indices vector
    thrust::device_vector<int> d_row_indices(Nrows * Ncols);
    thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(Nrows * Ncols), thrust::make_constant_iterator(Ncols), d_row_indices.begin(), thrust::divides<int>() );

    // --- Computing column indices vector
    thrust::device_vector<int> d_column_indices(Nrows * Ncols);
    thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(Nrows * Ncols), thrust::make_constant_iterator(Ncols), d_column_indices.begin(), mod_functor<int>());

    // --- int and float iterators
    typedef thrust::device_vector<int>::iterator        IntIterator;
    typedef thrust::device_vector<float>::iterator      FloatIterator;

    // --- Relevant tuples of int and float iterators
    typedef thrust::tuple<IntIterator, IntIterator>     IteratorTuple1;
    typedef thrust::tuple<FloatIterator, IntIterator>   IteratorTuple2;

    // --- zip_iterator of the relevant tuples
    typedef thrust::zip_iterator<IteratorTuple1>        ZipIterator1;
    typedef thrust::zip_iterator<IteratorTuple2>        ZipIterator2;

    // --- zip_iterator creation
    ZipIterator1 iter1(thrust::make_tuple(d_column_indices.begin(), d_row_indices.begin()));

    thrust::stable_sort_by_key(d_matrix.begin(), d_matrix.end(), iter1);

    ZipIterator2 iter2(thrust::make_tuple(d_matrix.begin(), d_row_indices.begin()));

    thrust::stable_sort_by_key(d_column_indices.begin(), d_column_indices.end(), iter2);

    typedef thrust::device_vector<int>::iterator Iterator;

    // --- Strided access to the sorted array
    strided_range<Iterator> d_min_indices_2(d_row_indices.begin(), d_row_indices.end(), Nrows);

    printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());

    printf("\n\n");
    std::copy(d_min_indices_2.begin(), d_min_indices_2.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << std::endl;

    return 0;
}

Testing the two approaches for the case of 2000x2000 sized matrices, this has been the result on a Kepler K20c card: 对于2000x2000大小的矩阵,测试两种方法,这是Kepler K20c卡的结果:

Eric's             :  8.4s
Robert Crovella's  : 33.4s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM