简体   繁体   English

如何在推力中减少二维数据的一维

[英]How to do a reduction over one dimension of 2D data in Thrust

I'm new to CUDA and the thrust library.我是 CUDA 和推力库的新手。 I'm learning and trying to implement a function that will have a for loop doing a thrust function. Is there a way to convert this loop into another thrust function?我正在学习并尝试实现一个 function,它将有一个 for 循环执行推力 function。有没有办法将此循环转换为另一个推力 function? Or should I use a CUDA kernel to achieve this?或者我应该使用 CUDA kernel 来实现这个?

I have come up with code like this我想出了这样的代码

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

My goal is to find the row that has the most elements greater than 2. I'm struggling at how to do this without a loop.我的目标是找到具有最多元素大于 2 的行。我正在努力研究如何在没有循环的情况下执行此操作。 Any suggestions would be very appreciated.任何建议将不胜感激。 Thanks.谢谢。

Solution using Thrust使用推力的解决方案

Here is an implementation using thrust::reduce_by_key in conjunction with multiple "fancy iterators".这是一个使用thrust::reduce_by_key结合多个“花式迭代器”的实现。

I also took the freedom to sprinkle in some const , auto and lambdas for elegance and readability.为了优雅和可读性,我还自由地加入了一些constauto和 lambda 。 Due to the lambdas, you will need to use the -extended-lambda flag for nvcc .由于 lambda,您需要为nvcc使用-extended-lambda标志。

thrust::distance is the canonical way of subtracting Thrust iterators. thrust::distance是减去 Thrust 迭代器的规范方法。

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

Bonus solution using MatX使用 MatX 的奖金解决方案

As mentioned in the comments NVIDIA has released a new high-level, C++17 library called MatX which targets problems involving (dense) multi-dimensional data (ie tensors).正如评论中提到的,NVIDIA 发布了一个名为MatX的新高级 C++17 库,它针对涉及(密集)多维数据(即张量)的问题。 The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface.该库试图在一个类似 python/matlab 的界面中统一多个低级库,如 CUFFT、CUSOLVER 和 CUTLASS。 At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet.在撰写本文时 (v0.2.2),库仍处于初始开发阶段,因此可能无法保证稳定的 API。因此,性能没有像更成熟的 Thrust 库和文档/示例并不十分详尽,MatX 不应在生产代码中使用。 While constructing this solution I actually stumbled upon a bug which was instantly fixed.在构建此解决方案时,我实际上偶然发现了一个立即修复的错误 So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation yet.所以这段代码只能在主分支上工作,而不适用于当前版本 v0.2.2,一些使用的功能可能还没有出现在文档中。

A solution using MatX looks the following way:使用 MatX 的解决方案如下所示:

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

As MatX employs deferred execution operators, matx::as_int(tensor > 2) is effectively fused into the kernel achieving the same as using a thrust::transform_iterator in Thrust.由于 MatX 使用延迟执行运算符, matx::as_int(tensor > 2)有效地融合到 kernel 中,实现与在 Thrust 中使用thrust::transform_iterator相同的效果。

Due to MatX knowing about the regularity of the problem while Thrust does not, the MatX solution could potentially be more performant than the Thrust solution.由于 MatX 知道问题的规律性而 Thrust 不知道,因此 MatX 解决方案可能比 Thrust 解决方案性能更高。 It certainly is more elegant.它当然更优雅。 It is also possible to construct tensors in already allocated memory, so one can mix the libraries eg my constructing a tensor in the memory of a thrust::vector named vec via passing thrust::raw_pointer_cast(vec.data()) to the constructor of the tensor.也可以在已分配的 memory 中构造张量,因此可以混合库,例如我通过将thrust::raw_pointer_cast(vec.data())传递给构造函数,在名为vecthrust::vector的 memory 中构造张量张量的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM