使用CUDA缩放矩阵的行

Question

In some computations on the GPU, I need to scale the rows in a matrix so that all the elements in a given row sum to 1. 在GPU上的一些计算中，我需要缩放矩阵中的行，以便给定行中的所有元素总和为1。

| a_1,1 a_1,2 ... a_1,N |    | alpha₁*a_1,1 alpha₁*a_1,2 ... alpha₁*a_1,N |
| a_2,1 a_2,2 ... a_2,N | => | alpha₂*a_2,1 alpha₂*a_2,2 ... alpha₂*a_2,N |
| .            .   |    | .                                .    |
| a_N,1 a_N,2 ... a_N,N |    | alpha_N*a_N,1 alpha_N*a_N,2 ... alpha_N*a_N,N |

where 哪里

alpha_i =  1.0/(a_i,1 + a_i,2 + ... + a_i,N)

I need the vector of alpha 's, and the scaled matrix and I would like to do this in as few blas calls as possible. 我需要alpha的矢量和缩放矩阵，我想在尽可能少的blas调用中执行此操作。 The code is going to run on nvidia CUDA hardware. 该代码将在nvidia CUDA硬件上运行。 Does anyone know of any smart way to do this? 有谁知道有任何聪明的方法来做到这一点？

Answer 1

Cublas 5.0 introduced a blas-like routine called cublas(Type)dgmm which is the multiplication of a matrix by a diagonal matrix (represented by a vector). Cublas 5.0引入了类似blas的例程，称为cublas（Type）dgmm，它是矩阵乘以对角矩阵（由向量表示）的乘法。

There is a left option ( which will scale the rows) or a right option that will scale the column. 左侧选项（将缩放行）或右侧选项将缩放列。

Please refer to CUBLAS 5.0 documentation for details. 有关详细信息，请参阅CUBLAS 5.0文档。

So in your problem, you need to create a vector containing all the alpha on the GPU and use cublasdgmm with the left option. 所以在你的问题中，你需要创建一个包含GPU上所有alpha的向量，并使用带左选项的cublasdgmm。

Answer 2

I want to update the answers above with an example considering the use of CUDA Thrust's thrust::transform and of cuBLAS 's cublas<t>dgmm . 我想用一个例子来更新上面的答案，考虑使用CUDA Thrust的thrust::transform和cuBLAS的cublas<t>dgmm 。 I'm skipping the calculation of the scaling factors alpha 's, since this has been already dealt with at 我正在跳过缩放因子alpha的计算，因为已经处理过了

Reduce matrix rows with CUDA 使用CUDA减少矩阵行

and 和

Reduce matrix columns with CUDA 使用CUDA减少矩阵列

Below is a complete example: 以下是一个完整的例子：

#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/random.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/equal.h>

#include <cublas_v2.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/***********************/
/* RECIPROCAL OPERATOR */
/***********************/
struct Inv: public thrust::unary_function<float, float>
{
    __host__ __device__ float operator()(float x)
    {
        return 1.0f / x;
    }
};

/********/
/* MAIN */
/********/
int main()
{
    /**************************/
    /* SETTING UP THE PROBLEM */
    /**************************/

    const int Nrows = 10;           // --- Number of rows
    const int Ncols =  3;           // --- Number of columns  

    // --- Random uniform integer distribution between 0 and 100
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist1(0, 100);

    // --- Random uniform integer distribution between 1 and 4
    thrust::uniform_int_distribution<int> dist2(1, 4);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist1(rng);

    // --- Normalization vector allocation and initialization
    thrust::device_vector<float> d_normalization(Nrows);
    for (size_t i = 0; i < d_normalization.size(); i++) d_normalization[i] = (float)dist2(rng);

    printf("\n\nOriginal matrix\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    printf("\n\nNormlization vector\n");
    for(int i = 0; i < Nrows; i++) std::cout << d_normalization[i] << "\n";

    TimingGPU timerGPU;

    /*********************************/
    /* ROW NORMALIZATION WITH THRUST */
    /*********************************/

    thrust::device_vector<float> d_matrix2(d_matrix);

    timerGPU.StartCounter();
    thrust::transform(d_matrix2.begin(), d_matrix2.end(),
                      thrust::make_permutation_iterator(
                                d_normalization.begin(),
                                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols))),
                      d_matrix2.begin(),
                      thrust::divides<float>());
    std::cout << "Timing - Thrust = " << timerGPU.GetCounter() << "\n";

    printf("\n\nNormalized matrix - Thrust case\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix2[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    /*********************************/
    /* ROW NORMALIZATION WITH CUBLAS */
    /*********************************/
    d_matrix2 = d_matrix;

    cublasHandle_t handle;
    cublasSafeCall(cublasCreate(&handle));

    timerGPU.StartCounter();
    thrust::transform(d_normalization.begin(), d_normalization.end(), d_normalization.begin(), Inv());
    cublasSafeCall(cublasSdgmm(handle, CUBLAS_SIDE_RIGHT, Ncols, Nrows, thrust::raw_pointer_cast(&d_matrix2[0]), Ncols, 
                   thrust::raw_pointer_cast(&d_normalization[0]), 1, thrust::raw_pointer_cast(&d_matrix2[0]), Ncols));
    std::cout << "Timing - cuBLAS = " << timerGPU.GetCounter() << "\n";

    printf("\n\nNormalized matrix - cuBLAS case\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix2[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    return 0;
}

The Utilities.cu and Utilities.cuh files are mantained here and omitted here. Utilities.cu和Utilities.cuh文件在此处保留，此处省略。 The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well. TimingGPU.cu和TimingGPU.cuh在这里维护，也被省略。

I have tested the above code on a Kepler K20c and these are the result: 我在Kepler K20c上测试了上面的代码，结果如下：

                 Thrust      cuBLAS
2500 x 1250      0.20ms      0.25ms
5000 x 2500      0.77ms      0.83ms

In the cuBLAS timing, I'm excluding the cublasCreate time. 在cuBLAS时间，我不包括cublasCreate时间。 Even with this, the CUDA Thrust version seems to be more convenient. 即便如此，CUDA Thrust版本似乎更方便。

Answer 3

If you use BLAS gemv with a unit vector, the result will a vector of the reciprocal of scaling factors (1/alpha) you need. 如果你使用BLAS gemv和单位向量，结果将是你需要的缩放因子（1 / alpha）倒数的向量。 That is the easy part. 这是容易的部分。

Applying the factors row wise is a bit harder, because standard BLAS doesn't have anything like a Hadamard product operator you could use. 逐行应用因子有点困难，因为标准BLAS没有像您可以使用的Hadamard产品运算符那样的东西。 Also because you are mentioning BLAS, I presume you are using column major order storage for your matrices, which is not so straightforward for row wise operations. 另外因为你提到BLAS，我认为你正在为你的矩阵使用列主要订单存储，这对于行式操作来说并不是那么简单。 The really slow way to do it would be to BLAS scal on each row with a pitch, but that would require one BLAS call per row and the pitched memory access will kill performance because of the effect on coalescing and L1 cache coherency. 真正缓慢的方法是使用一个音调对每行上的BLAS scal进行调整，但是每行需要一次BLAS调用，并且由于对合并和L1缓存一致性的影响，调整的内存访问将会降低性能。

My suggestion would be to use your own kernel for the second operation. 我的建议是使用你自己的内核进行第二次操作。 It doesn't have to be all that complex, perhaps only something like this: 它不一定非常复杂，也许只有这样：

template<typename T>
__global__ void rowscale(T * X, const int M, const int N, const int LDA,
                             const T * ralpha)
{
    for(int row=threadIdx.x; row<M; row+=gridDim.x) {
        const T rscale = 1./ralpha[row]; 
        for(int col=blockIdx.x; col<N; col+=blockDim.x) 
            X[row+col*LDA] *= rscale;
    }
}

That just has a bunch of blocks stepping through the rows columnwise, scaling as they go along. 只有一堆块逐列地逐行排列，随着它们的进行缩放。 Should work for any sized column major ordered matrix. 应适用于任何大小的列主要有序矩阵。 Memory access should be coalesced, but depending on how worried about performance you are, there are a number of optimization you could try. 内存访问应该合并，但根据您对性能的担忧程度，您可以尝试一些优化。 It at least gives a general idea of what to do. 它至少给出了如何做的一般概念。

使用CUDA缩放矩阵的行

问题描述

3 个解决方案

解决方案1
6 2012-09-27 18:55:18

解决方案2
2 2015-05-18 20:10:14

解决方案3
2 已采纳 2012-02-15 12:42:09

使用CUDA缩放矩阵的行

问题描述

3 个解决方案

解决方案1 6 2012-09-27 18:55:18

解决方案2 2 2015-05-18 20:10:14

解决方案3 2 已采纳 2012-02-15 12:42:09

解决方案1
6 2012-09-27 18:55:18

解决方案2
2 2015-05-18 20:10:14

解决方案3
2 已采纳 2012-02-15 12:42:09