简体   繁体   English


[英]Reduce matrix columns with CUDA

I have a matrix and I would like to use CUDA and in the fastest possible way compute the column-wise mean (boils down to be simply the sum), ie, return a row vector containing the mean of every column in that matrix. 我有一个矩阵,我想使用CUDA,并以最快的方式计算列方式均值(简化为简单的总和),即返回包含该矩阵中每列的平均值的行向量。 A sum reduction implementation for computing the sum of a single column vector looks like this: 用于计算单列向量之和的总和减少实现如下所示:

template<typename T>
__global__ void kernelSum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
    extern __shared__ T sdata[];

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // load input into __shared__ memory
    T x = 0.0;
    if (tid < n) {
        x = input[tid];
    sdata[threadIdx.x] = x;

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        // wait until all threads in the block have
        // updated their partial sums

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];

and this is invoked as: 这被调用为:

int n = ... // vector size
const int BLOCK_SIZE = 1024;
int number_of_blocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
double* per_block_results = NULL;
cudaMalloc((void**) &per_block_results, sizeof(double)*(number_of_blocks + 1));
// launch one kernel to compute, per-block, a partial sum
kernelSum<double> <<<number_of_blocks, BLOCK_SIZE, BLOCK_SIZE*sizeof(double)>>>(a, per_block_results, n);
// launch a single block to compute the sum of the partial sums
kernelSum<double> <<<1, number_of_blocks, number_of_blocks*sizeof(double)>>>(per_block_results, per_block_results + number_of_blocks, number_of_blocks);

I could generalize this kernel to matrices of any number of columns but I'm limited by the shared memory. 我可以将这个内核推广到任意数量的列的矩阵,但我受共享内存的限制。 My GPU has compute capability 3.5 so it has 48KB of shared memory and a maximum block size of 1024 ie number of threads per block. 我的GPU具有3.5计算能力,因此它具有48KB的共享内存,最大块大小为1024即每个块的线程数。 Since I am interested in double-precision, I have 48*1024/8= 6144 maximum doubles of shared memory. 由于我对双精度感兴趣,我有48*1024/8= 6144共享内存的最大双倍。 Since the reduction is done per block, I can have a maximum of 6144 (doubles in shared memory) / 1024 (block size) = 6 columns for which I can compute the sum reduction simultaneously. 由于每个块都进行了缩减,因此我可以最多使用6144 (doubles in shared memory) / 1024 (block size) = 6列,我可以同时计算减少的总和。 Reducing the block size then would allow to compute more columns simultaneously eg 6144 (doubles in shared memory) / 512 (block size) = 12 . 然后,减小块大小将允许同时计算更多列,例如6144 (doubles in shared memory) / 512 (block size) = 12

Would this more complex strategy beat the simple CPU loop over every column of the matrix and invoke the sum reduction. 这种更复杂的策略是否会超过矩阵每列的简单CPU循环并调用总和减少量。 Is there yet another better way to do this? 还有另一种更好的方法吗?

What is stopping you doing something like this: 什么阻止你做这样的事情:

template<typename T>
__global__ void kernelSum(const T* __restrict__ input, 
                          T* __restrict__ per_block_results, 
                          const size_t lda, const size_t n)
    extern __shared__ T sdata[];

    // Accumulate per thread partial sum
    T x = 0.0;
    T * p = &input[blockIdx.x * lda];
    for(int i=threadIdx.x; i < n; i += blockDim.x) {
        x += p[i];

    // load partial sum into __shared__ memory
    sdata[threadIdx.x] = x;

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        // wait until all threads in the block have
        // updated their partial sums

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];

[standard disclaimer: written in browser, never compiled or tested, use at own risk] [标准免责声明:用浏览器编写,从未编译或测试,使用风险自负]

ie. 即。 you only need one entry in sdata for each thread in the block for the shared memory reduction. 对于共享内存缩减,块中的每个线程只需要sdata一个条目。 Each thread sums as many values as required to cover the full column input. 每个线程总和所需的值以覆盖整列输入。 Then there is no shared memory restriction, you can sum any size column with the same block size. 然后没有共享内存限制,您可以使用相同的块大小对任何大小的列求和。

EDIT: Apparently the idea of using per thread partial sums is new to you, so here is a complete example to study: 编辑:显然使用每个线程的部分总和的想法对你来说是新的,所以这里有一个完整的例子来研究:

#include <iostream>

template<typename T>
void kernelSum(const T* __restrict__ input, 
        const size_t lda, // pitch of input in words of sizeof(T)
        T* __restrict__ per_block_results, 
                const size_t n)
    extern __shared__ T sdata[];

    T x = 0.0;
    const T * p = &input[blockIdx.x * lda];
    // Accumulate per thread partial sum
    for(int i=threadIdx.x; i < n; i += blockDim.x) {
        x += p[i];

    // load thread partial sum into shared memory
    sdata[threadIdx.x] = x;

    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];

int main(void)
    const int m = 10000, n = 16;
    float * a = new float[m*n];

    for(int i=0; i<(m*n); i++) { a[i] = (float)(i%10); }

    float *a_;
    size_t size_a = m * n * sizeof(float);
    cudaMalloc((void **)&a_, size_a);
    cudaMemcpy(a_, a, size_a, cudaMemcpyHostToDevice);

    float *b_;
    size_t size_b = n * sizeof(float);
    cudaMalloc((void **)&b_, size_b);

    // select number of warps per block according to size of the
    // colum and launch one block per column. Probably makes sense
    // to have at least 4:1 column size to block size
    dim3 blocksize(256); 
    dim3 gridsize(n);
    size_t shmsize = sizeof(float) * (size_t)blocksize.x;
    kernelSum<float><<<gridsize, blocksize, shmsize>>>(a_, b_, m, m);

    float * b = new float[n];
    cudaMemcpy(b, b_, size_b, cudaMemcpyDeviceToHost);

    for(int i=0; i<n; i++) {
       std::cout << i << " " << b[i] << std::endl;


    return 0;

You should experiment with the block size relative to your matrix size for optimal performance, but in general the more work per thread the kernel does, the better the overall performance will be (because the shared memory reduction is quite expensive). 您应该尝试相对于矩阵大小的块大小以获得最佳性能,但通常内核每个线程的工作量越多,整体性能就越好(因为共享内存减少非常昂贵)。 You can see one approach to block and grid size heuristics for similarly memory bandwidth bound problem in this answer . 您可以在此答案中看到一种阻止和网格大小启发式方法的类似内存带宽限制问题。

As alternatives to the answer already provided by talonmies, I'm here reporting 4 other approaches for column reduction, 3 of them based on using CUDA Thrust and 1 based on using cublas<t>gemv() with a column of 1 's, as suggested in my comment above. 作为talonmies已经提供的答案的替代方案,我在这里报告了其他4减少列的方法,其中3种基于使用CUDA Thrust, 1基于使用cublas<t>gemv()cublas<t>gemv() 1 ,正如我上面的评论所示。

The CUDA Thrust approaches are the analogous of Reduce matrix rows with CUDA with an implicit transposition obtained by CUDA推力方法类似于使用CUDA减少矩阵行,并通过获得隐式转置

                (_1 % Nrows) * Ncols + _1 / Nrows))

Here is the full code: 这是完整的代码:

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

using namespace thrust::placeholders;

// --- Required for approach #2
__device__ float *vals;

template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }

struct col_reduction {

    const int Nrows;    // --- Number of rows
    const int Ncols;    // --- Number of cols

    col_reduction(int _Nrows, int _Ncols) : Nrows(_Nrows), Ncols(_Ncols) {}

    __device__ float operator()(float& x, int& y ) {
        float temp = 0.f;
        for (int i = 0; i<Nrows; i++) {
            temp += vals[y + (i*Ncols)];
        return temp;

template<typename T>
struct MulC: public thrust::unary_function<T, T>
    T C;
    __host__ __device__ MulC(T c) : C(c) { }
    __host__ __device__ T operator()(T x) { return x * C; }

/* MAIN */
int main()
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /* APPROACH #1 */
    // --- Allocate space for row sums and indices
    thrust::device_vector<float> d_col_sums(Ncols);
    thrust::device_vector<int> d_col_indices(Ncols);

    // --- Compute row sums by summing values with equal row indices
    thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Nrows)),
                          thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Nrows)) + (Nrows*Ncols),
                                thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),

 //               thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows)),
 //               thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows)) + (Nrows*Ncols),
 //               thrust::make_permutation_iterator(
    //              d_matrix.begin(),
    //              thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
 //               thrust::make_discard_iterator(),
 //               d_col_sums.begin());

    printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());

    // --- Print result
    for(int j = 0; j < Ncols; j++) {
        std::cout << "[ ";
        for(int i = 0; i < Nrows; i++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_col_sums[j] << "\n";

    /* APPROACH #2 */
    thrust::device_vector<float> d_col_sums_2(Ncols, 0);
    float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
    gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
    thrust::transform(d_col_sums_2.begin(), d_col_sums_2.end(), thrust::counting_iterator<int>(0), d_col_sums_2.begin(), col_reduction(Nrows, Ncols));

    printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());

    for(int j = 0; j < Ncols; j++) {
        std::cout << "[ ";
        for(int i = 0; i < Nrows; i++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_col_sums_2[j] << "\n";

    /* APPROACH #3 */

    thrust::device_vector<float> d_col_sums_3(Ncols, 0);
    thrust::device_vector<float> d_temp(Nrows * Ncols);
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows)) + (Nrows*Ncols),
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
                        d_temp.begin() + Nrows - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Nrows))),
                        d_temp.begin() + Nrows - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Nrows))) + Ncols,

    printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());

    for(int j = 0; j < Ncols; j++) {
        std::cout << "[ ";
        for(int i = 0; i < Nrows; i++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_col_sums_3[j] << "\n";

    /* APPROACH #4 */
    cublasHandle_t handle;


    thrust::device_vector<float> d_col_sums_4(Ncols);
    thrust::device_vector<float> d_ones(Nrows, 1.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_N, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols, 
                               thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_col_sums_4.data()), 1));

    printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());

    for(int j = 0; j < Ncols; j++) {
        std::cout << "[ ";
        for(int i = 0; i < Nrows; i++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_col_sums_4[j] << "\n";

    return 0;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM