使用CUDA減少矩陣行

Question

Windows 7, NVidia GeForce 425M.

我編寫了一個簡單的CUDA代碼，該代碼計算矩陣的行總和。 矩陣具有一維表示形式（指向浮點數的指針）。

下面是代碼的串行版本（如預期的那樣，它具有2循環）：

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

在CUDA代碼內部，我調用了內核函數，它按行掃描矩陣。 下面是內核調用代碼段：

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

和執行行的並行求和的內核函數（仍然具有1循環）：

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

到現在為止還挺好。 串行和並行（CUDA）結果相等。

整個問題是，即使我更改nThreadsPerBlock參數，CUDA版本幾乎要花一倍的時間來計算：我測試了nThreadsPerBlock從32到1024 （我的卡允許的每個塊的最大線程數）。

IMO，矩陣尺寸足夠大以證明並行化： 90,000 x 1,000 。

下面，我報告使用不同的nThreadsPerBlock進行串行和並行版本的時間。 平均以100 msec為單位報告的時間（以msec為單位）：

矩陣：nrow = 90000 x ncol = 1000

序列：每個樣本經過的平均時間（以毫秒為單位）（ 100樣本）： 289.18 。

CUDA（ 32 497.11每個塊）：每個樣本平均經過的時間（以毫秒為單位）（ 100樣本）： 497.11 。

CUDA（ 1024 ThreadsPerBlock）：每個樣本平均經過的時間（以毫秒為單位）（ 100樣本）： 699.66 。

以防萬一，在版本32 / 1024 nThreadsPerBlock是最快/最慢的一個。

我知道從主機復制到設備以及以其他方式進行復制時會產生某種開銷，但是可能速度較慢是因為我沒有實現最快的代碼。

由於我遠非CUDA專家：

我是否為此任務編寫了最快的版本？ 如何改善我的代碼？ 我可以擺脫內核函數中的循環嗎？

任何想法表示贊賞。

編輯1

盡管我描述了一個標准的rowSum ，但我對具有(0;1}值的行的AND / OR操作感興趣，例如rowAND / rowOR 。也就是說，它不允許我利用cuBLAS乘以1的正如一些評論員所建議的， COL列矢量技巧。

編輯2

根據用戶的建議，其他用戶在這里認可：

忘記嘗試編寫自己的功能 ，而是使用Thrust庫，魔力來了。

Answer 1

既然您提到了，您只需要總和以外的一般歸約算法。 我將在這里嘗試給出3種方法。 內核方法可能具有最高的性能。 推力方法最容易實現。 cuBLAS方法僅適用於總和且具有良好的性能。

內核方法

這是一篇非常好的文檔，介紹了如何優化標准並行約簡。 標准降低可分為兩個階段。

多個線程塊各自減少了數據的一部分；
一個線程塊從階段1的結果減少到最后的1個元素。

對於您的多次歸約（減少行數）問題，僅階段1就足夠了。 想法是每個線程塊減少1行。 有關其他考慮因素，例如每個線程塊多行或每個多個線程塊一行，可以參考@Novak提供的論文。 這可以進一步提高性能，尤其是對於形狀較差的矩陣。

推力法

可以在幾分鍾內通過thrust::reduction_by_key來完成一般的多次約簡。 您可以在此處找到一些討論，使用CUDA Thrust確定最小元素及其在每個矩陣列中的位置。

但是thrust::reduction_by_key不假定每一行都有相同的長度，因此您會受到性能損失。 另一篇文章如何以最大性能規范化CUDA中的矩陣列？ 給出thrust::reduction_by_key ::: thrust::reduction_by_key和cuBLAS方法對行總和的性能分析比較。 它可以使您對性能有基本的了解。

cuBLAS方法

矩陣A的行/列總和可以看作是矩陣-向量乘法，其中向量的元素都是1。 它可以由以下matlab代碼表示。

y = A * ones(size(A,2),1);

其中y是A的行之和。

cuBLAS庫為該操作提供了高性能的矩陣矢量乘法函數cublas<t>gemv() 。

時序結果表明，該例程僅比一次讀取A的所有元素慢10％到50％，這可以看作是該操作性能的理論上限。

Answer 2

減少矩陣的行數可以通過三種方式使用CUDA Thrust解決（它們可能不是唯一的方法，但是解決這一問題超出了范圍）。 正如同一OP所認識到的，使用CUDA Thrust對於此類問題更可取。 同樣，使用cuBLAS的方法也是可能的。

方法1- reduce_by_key

這是此“ 推力”示例頁面中建議的方法。 它包括使用make_discard_iterator的變體。

方法2- transform

這是CUDA Thrust的Robert Crovella建議的方法：基於“鍵”數組中的值，只能對數組中的某些值進行reduce_by_key 。

方法3- inclusive_scan_by_key

這是Eric在如何以最大性能對CUDA中的矩陣列進行規范化中建議的方法？ 。

方法4- cublas<t>gemv

它使用cuBLAS gemv將相關矩陣乘以1的列。

完整代碼

這是濃縮兩種方法的代碼。 Utilities.cu和Utilities.cuh文件在此處維護，此處省略。 在此處維護TimingGPU.cu和TimingGPU.cuh也將其省略。

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

// --- Required for approach #2
__device__ float *vals;

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {

    const int Ncols;    // --- Number of columns

    row_reduction(int _Ncols) : Ncols(_Ncols) {}

    __device__ float operator()(float& x, int& y ) {
        float temp = 0.f;
        for (int i = 0; i<Ncols; i++)
            temp += vals[i + (y*Ncols)];
        return temp;
    }
};

/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ MulC(T c) : C(c) { }
    __host__ __device__ T operator()(T x) { return x * C; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /***************/
    /* APPROACH #1 */
    /***************/
    timerGPU.StartCounter();
    // --- Allocate space for row sums and indices
    thrust::device_vector<float> d_row_sums(Nrows);
    thrust::device_vector<int> d_row_indices(Nrows);

    // --- Compute row sums by summing values with equal row indices
    //thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
    //                    thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
    //                    d_matrix.begin(),
    //                    d_row_indices.begin(),
    //                    d_row_sums.begin(),
    //                    thrust::equal_to<int>(),
    //                    thrust::plus<float>());

    thrust::reduce_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                thrust::make_discard_iterator(),
                d_row_sums.begin());

    printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());

    // --- Print result
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums[i] << "\n";
    }

    /***************/
    /* APPROACH #2 */
    /***************/
    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_2(Nrows, 0);
    float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
    gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
    thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0),  d_row_sums_2.begin(), row_reduction(Ncols));

    printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_2[i] << "\n";
    }

    /***************/
    /* APPROACH #3 */
    /***************/

    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_3(Nrows, 0);
    thrust::device_vector<float> d_temp(Nrows * Ncols);
    thrust::inclusive_scan_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                d_temp.begin());
    thrust::copy(
                thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
    thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
                d_row_sums_3.begin());

    printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_3[i] << "\n";
    }

    /***************/
    /* APPROACH #4 */
    /***************/
    cublasHandle_t handle;

    timerGPU.StartCounter();
    cublasSafeCall(cublasCreate(&handle));

    thrust::device_vector<float> d_row_sums_4(Nrows);
    thrust::device_vector<float> d_ones(Ncols, 1.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols, 
                               thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));

    printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_4[i] << "\n";
    }

    return 0;
}

時序結果 （在開普勒K20c上測試）

Matrix size       #1     #1-v2     #2     #3     #4     #4 (no plan)
100  x 100        0.63   1.00     0.10    0.18   139.4  0.098
1000 x 1000       1.25   1.12     3.25    1.04   101.3  0.12
5000 x 5000       8.38   15.3     16.05   13.8   111.3  1.14

 100 x 5000       1.25   1.52     2.92    1.75   101.2  0.40    

5000 x 100        1.35   1.99     0.37    1.74   139.2  0.14

似乎方法＃1和＃3優於方法＃2，除了在列數較少的情況下。 但是，最好的方法是方法4，它比其他方法更加方便，只要可以在計算過程中攤銷創建計划所需的時間即可。

Answer 3

如果這是您需要使用此數據進行操作的范圍（匯總行），那么我預計GPU不會帶來可觀的收益。 每個數據元素只有一個算術運算，為此您要付出將數據元素傳輸到GPU的費用。 除了一定的問題大小（無論機器忙什么），由於算術強度為O（n），因此無法從更大的問題大小中獲得更多好處。

因此，這不是在GPU上解決的特別令人興奮的問題。

但是，正如前言所表明的那樣，您在制作工藝上存在一個合並問題，這將進一步降低速度。 讓我們看一個小例子：

    C1  C2  C3  C4
R1  11  12  13  14
R2  21  22  23  24
R3  31  32  33  34
R4  41  42  43  44

上面是矩陣一小部分的簡單圖示示例。 機器數據存儲應將元素（11），（12），（13）和（14）存儲在相鄰的存儲位置中。

對於合並訪問，我們需要一種訪問模式，以便從同一條指令中請求相鄰的內存位置，並在扭曲中執行。

我們需要從warp的角度考慮代碼的執行，即在鎖步中執行32個線程。 您的代碼在做什么？ 在每個步驟/指令中都檢索（要求）哪些元素？ 讓我們看一下這行代碼：

        sum+=m[rowIdx*ncol+k];

創建變量時，經紗中的相鄰線程具有rowIdx相鄰（即連續）值。 因此，當k = 0時，當我們嘗試檢索值m[rowIdx*ncol+k]時，每個線程都要求哪個數據元素？

在塊0中，線程0的rowIdx為0。線程1的rowIdx為1， rowIdx 。因此，每個線程在此指令中要求的值是：

Thread:   Memory Location:    Matrix Element:
     0      m[0]                   (11)
     1      m[ncol]                (21)
     2      m[2*ncol]              (31)
     3      m[3*ncol]              (41)

但這不是合並訪問！ 元素（11），（21）等在內存中不相鄰。 對於合並訪問，我們希望“矩陣元素”行的內容如下：

Thread:   Memory Location:    Matrix Element:
     0      m[?]                   (11)
     1      m[?]                   (12)
     2      m[?]                   (13)
     3      m[?]                   (14)

如果您隨后進行倒推以確定值是? 應該是，您將提出類似以下的指令：

        sum+=m[k*ncol+rowIdx];

這將提供合並的訪問權限，但不會為您提供正確的答案，因為我們現在正在匯總矩陣列而不是矩陣行。 我們可以通過將您的數據存儲重新組織為列優先順序而不是行優先順序來解決此問題。 （您應該可以在Google上搜索到它的想法，對嗎？）從概念上講，這等效於轉換矩陣m 。 如我所見，這是否方便您在我的問題范圍之外，而實際上不是CUDA問題。 在主機上創建矩陣或將矩陣從主機傳輸到設備時，這可能對您來說很簡單。 但總而言之，如果矩陣以行優先順序存儲，我不知道用100％合並訪問來對矩陣行求和的方法。 （您可以采用一系列的行減少操作，但這對我來說很痛苦。）

當我們正在考慮在GPU上加速代碼的方式時，考慮重新組織數據存儲以方便GPU的情況並不少見。 這是一個例子。

而且，是的，我在這里概述的內容仍然在內核中保留了一個循環。

作為補充說明，我建議分別對數據復制部分和內核（計算）部分進行計時。 從您的問題中我無法確定您是在計時內核還是整個（GPU）操作，包括數據副本。 如果單獨對數據復制計時，則可能會發現僅數據復制時間超過了CPU時間。 優化CUDA代碼所做的任何努力都不會影響數據復制時間。 在花費大量時間之前，這可能是有用的數據點。

使用CUDA減少矩陣行

問題描述

編輯1

編輯2

3 個解決方案

解決方案1
13 已采納 2013-07-25 17:15:35

內核方法

推力法

cuBLAS方法

解決方案2
4 2015-04-08 21:43:08

解決方案3
3 2013-07-25 16:27:27

使用CUDA減少矩陣行

問題描述

編輯1

編輯2

3 個解決方案

解決方案1 13 已采納 2013-07-25 17:15:35

內核方法

推力法

cuBLAS方法

解決方案2 4 2015-04-08 21:43:08

解決方案3 3 2013-07-25 16:27:27

解決方案1
13 已采納 2013-07-25 17:15:35

解決方案2
4 2015-04-08 21:43:08

解決方案3
3 2013-07-25 16:27:27