解决CUDA中的三对角线性系统问题

Question

I am trying to implement a tridiagonal system solver based on the Cyclic Reduction method on my GTS450 . 我正在尝试在我的GTS450上实现基于Cyclic Reduction方法的三对角系统求解器。

Cyclic Reduction is illustrated in this paper 本文阐述了循环减少

Y. Zhang, J. Cohen, JD Owens, "Fast Tridiagonal Solvers on GPU" Y. Zhang，J。Cohen，JD Owens，“GPU上的快速三对角解算器”

However, whatever I do, my CUDA code is far slower than the sequential counterpart. 但是，无论我做什么，我的CUDA代码都比顺序代码快得多。 My result for a total of 512 x 512 points is 7ms , however on my i7 3.4GHz it is 5ms . 我的总共512 x 512点的结果是7ms ，但是在我的i7 3.4GHz上它是5 5ms 。 The GPU is not accelerating! GPU没有加速！

Which could be the problem? 哪个可能是问题？

#include "cutrid.cuh"
 __global__ void cutrid_RC_1b(double *a,double *b,double *c,double *d,double *x)
{
 int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
 int idx=threadIdx.x;

 __shared__ double asub[512];
 __shared__ double bsub[512];
 __shared__ double csub[512];
 __shared__ double dsub[512];

 double at=0;
 double bt=0;
 double ct=0;
 double dt=0;

 asub[idx]=a[idx_global];
 bsub[idx]=b[idx_global];
 csub[idx]=c[idx_global];
 dsub[idx]=d[idx_global];


 for(int stride=1;stride<N;stride*=2)
  {
    int margin_left,margin_right;
    margin_left=idx-stride;
    margin_right=idx+stride;


    at=(margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f; 

    bt=bsub[idx]+((margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?asub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    ct=(margin_right<512)?(-csub[idx+stride]*asub[idx]/bsub[idx+stride]):0.f; 

    dt=dsub[idx]+((margin_left>=0)?(-dsub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?dsub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    __syncthreads();
    asub[idx]=at;
    bsub[idx]=bt;
    csub[idx]=ct;
    dsub[idx]=dt;
    __syncthreads();
  }


x[idx_global]=dsub[idx]/bsub[idx];

}/*}}}*/

I launched this kernel by cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x) , and reached 100% device occupancy. 我通过cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x)启动了这个内核，并达到了100%设备占用率。 This result has puzzled me for days. 这个结果让我困惑了好几天。

There is an improved version of my code: 我的代码有一个改进版本：

    #include "cutrid.cuh"
    __global__ void cutrid_RC_1b(float *a,float *b,float *c,float *d,float *x)
    {/*{{{*/
     int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
     int idx=threadIdx.x;

     __shared__ float asub[512];
     __shared__ float bsub[512];
     __shared__ float csub[512];
     __shared__ float dsub[512];

    asub[idx]=a[idx_global];
    bsub[idx]=b[idx_global];
    csub[idx]=c[idx_global];
    dsub[idx]=d[idx_global];
 __syncthreads();
   //Reduction  
    for(int stride=1;stride<512;stride*=2)
    {
        int margin_left=(idx-stride);
        int margin_right=(idx+stride);
        if(margin_left<0) margin_left=0;
        if(margin_right>=512) margin_right=511;
        float tmp1 = asub[idx] / bsub[margin_left];
        float tmp2 = csub[idx] / bsub[margin_right];
        float tmp3 = dsub[margin_right];
        float tmp4 = dsub[margin_left];
        __syncthreads();

        dsub[idx] = dsub[idx] - tmp4*tmp1-tmp3*tmp2;
        bsub[idx] = bsub[idx]-csub[margin_left]*tmp1-asub[margin_right]*tmp2;

        tmp3 = -csub[margin_right]; 
        tmp4 = -asub[margin_left];

        __syncthreads();
        asub[idx] = tmp3*tmp1;
        csub[idx] = tmp4*tmp2;
        __syncthreads();
     }

        x[idx_global]=dsub[idx]/bsub[idx];

    }/*}}}*/

The speed is improved to 0.73ms on a Quadro k4000 for 512 x 512 system, however the code in the mentioned paper runs in 0.5ms on a GTX280 . 对于512 x 512系统， Quadro k4000的速度提升至0.73ms ，但上述文章中的代码在GTX280上以0.5ms速度运行。

Answer 1

Solving a tridiagonal system of equations is a challenging parallel problem since the classical solution scheme, ie, Gaussian elimination, is inherently sequential. 解决三对角方程系统是一个具有挑战性的并行问题，因为经典解决方案，即高斯消元，本质上是连续的。

Cyclic Reduction consists of two phases: 循环减少包括两个阶段：

Forward Reduction . 前进减少 。 The original system is split in two independent tridiagonal systems for two sets of unknowns, the ones with odd index and the ones with even index. 原始系统分为两个独立的三对角系统，用于两组未知数，即奇数索引和偶数索引。 Such systems can be solved independently and this step can be seen as the ﬁrst of a divide et impera scheme. 这样的系统可以独立解决，这一步骤可以看作是分裂和阻抗方案中的第一个。 The two smaller systems are split again in the same way in two subsystems and the process is repeated until a system of only 2 equations is reached. 两个较小的系统在两个子系统中以相同的方式再次分开，并且重复该过程直到达到仅有2方程的系统。
Backward Substitution . 向后替代 。 The system of 2 equations is solved first. 首先求解2方程的系统。 Then, the divide et impera structure is climbed up by solving the sub-systems independently on diﬀerent cores. 然后，通过在不同的核心上独立地求解子系统来攀爬分裂和阻抗结构。

I'm not sure (but correct me if I'm wrong) that your code will return consistent results. 我不确定（但如果我错了，请纠正我）你的代码会返回一致的结果。 N does not appear to be defined. N似乎没有定义。 Also, you are accessing csub[idx-stride] , but I'm not sure what does it mean when idx==0 and stride>1 . 此外，您正在访问csub[idx-stride] ，但我不确定当idx==0和stride>1时它是什么意思。 Furthermore, you are using several conditional statements, essentially for boundary checkings. 此外，您使用了几个条件语句，主要用于边界检查。 Finally, your code lacks a proper thread structure capable to deal with the mentioned divide et impera scheme, conceptually pretty much like the one used in the CUDA SDK reduction samples. 最后，你的代码缺乏一个能够处理上面提到的divide et impera方案的正确的线程结构，概念上非常类似于CUDA SDK减少样本中使用的那个。

As mentioned in one of my comments above, I remembered that at tridiagonalsolvers you can find an implementation of the Cyclic Reduction scheme for solving tridiagonal equation systems. 正如我在上面的一条评论中所提到的，我记得在tridiagonalsolvers中你可以找到一个循环减少方案的实现来解决三对角方程系统。 Browsing the related google pages, it seems to me that the code is mantained, among others, by the first Author of the above paper (Yao Zhang). 浏览相关的谷歌页面，在我看来，代码是由上述论文的第一作者（姚章）保留的。 The code is copied and pasted below. 代码将被复制并粘贴到下面。 Note that the boundary check is done only once ( if (iRight >= systemSize) iRight = systemSize - 1; ), thus limiting the number of conditional statements involved. 请注意，边界检查仅执行一次（ if (iRight >= systemSize) iRight = systemSize - 1; ），从而限制了所涉及的条件语句的数量。 Note also the thread structure capable to deal with a divide et impera scheme. 还要注意能够处理除法和求解方案的线程结构。

The code by Zhang, Cohen and Owens 张，科恩和欧文斯的代码

__global__ void crKernel(T *d_a, T *d_b, T *d_c, T *d_d, T *d_x)
{
   int thid = threadIdx.x;
   int blid = blockIdx.x;

   int stride = 1;

   int numThreads = blockDim.x;
   const unsigned int systemSize = blockDim.x * 2;

   int iteration = (int)log2(T(systemSize/2));
   #ifdef GPU_PRINTF 
    if (thid == 0 && blid == 0) printf("iteration = %d\n", iteration);
   #endif

   __syncthreads();

   extern __shared__ char shared[];

   T* a = (T*)shared;
   T* b = (T*)&a[systemSize];
   T* c = (T*)&b[systemSize];
   T* d = (T*)&c[systemSize];
   T* x = (T*)&d[systemSize];

   a[thid] = d_a[thid + blid * systemSize];
   a[thid + blockDim.x] = d_a[thid + blockDim.x + blid * systemSize];

   b[thid] = d_b[thid + blid * systemSize];
   b[thid + blockDim.x] = d_b[thid + blockDim.x + blid * systemSize];

   c[thid] = d_c[thid + blid * systemSize];
   c[thid + blockDim.x] = d_c[thid + blockDim.x + blid * systemSize];

   d[thid] = d_d[thid + blid * systemSize];
   d[thid + blockDim.x] = d_d[thid + blockDim.x + blid * systemSize];

   __syncthreads();

   //forward elimination
   for (int j = 0; j <iteration; j++)
   {
       __syncthreads();
       stride *= 2;
       int delta = stride/2;

    if (threadIdx.x < numThreads)
    {
        int i = stride * threadIdx.x + stride - 1;
        int iLeft = i - delta;
        int iRight = i + delta;
        if (iRight >= systemSize) iRight = systemSize - 1;
        T tmp1 = a[i] / b[iLeft];
        T tmp2 = c[i] / b[iRight];
        b[i] = b[i] - c[iLeft] * tmp1 - a[iRight] * tmp2;
        d[i] = d[i] - d[iLeft] * tmp1 - d[iRight] * tmp2;
        a[i] = -a[iLeft] * tmp1;
        c[i] = -c[iRight] * tmp2;
    }
       numThreads /= 2;
   }

   if (thid < 2)
   {
     int addr1 = stride - 1;
     int addr2 = 2 * stride - 1;
     T tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2];
     x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3;
     x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3;
   }

   // backward substitution
   numThreads = 2;
   for (int j = 0; j <iteration; j++)
   {
       int delta = stride/2;
       __syncthreads();
       if (thid < numThreads)
       {
           int i = stride * thid + stride/2 - 1;
           if(i == delta - 1)
                 x[i] = (d[i] - c[i]*x[i+delta])/b[i];
           else
                 x[i] = (d[i] - a[i]*x[i-delta] - c[i]*x[i+delta])/b[i];
        }
        stride /= 2;
        numThreads *= 2;
     }

   __syncthreads();

   d_x[thid + blid * systemSize] = x[thid];
   d_x[thid + blockDim.x + blid * systemSize] = x[thid + blockDim.x];

} }

Answer 2

I want to add a further answer to mention that tridiagonal systems can be easily solved in the framework of the cuSPARSE library by aid of the function 我想补充一点，提一下cuSPARSE系统可以通过函数帮助在cuSPARSE库的框架中轻松解决

cusparse<t>gtsv()

cuSPARSE also provides cuSPARSE还提供

cusparse<t>gtsv_nopivot()

which, at variance with the first mentioned routine, does not perform pivoting. 与第一个提到的例程不一致的是，它不执行旋转。 Both the above functions solve the same linear system with multiple right hand sides. 上述两种功能都解决了具有多个右侧的相同线性系统。 A batched routine 一个批量的例程

cusparse<t>gtsvStridedBatch()

also exists which solves multiple linear systems. 也存在解决多个线性系统的问题。

For all the above routines, the system matrix is fixed by simply specifying the lower diagonal, the main diagonal and the upper diagonal. 对于所有上述例程，通过简单地指定下对角线，主对角线和上对角线来固定系统矩阵。

Below, I'm reporting a fully worked out example using cusparse<t>gtsv() to solve a tridiagonal linear system. 下面，我将使用cusparse<t>gtsv()报告一个完全解决的例子来解决三对角线性系统。

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <assert.h>

#include <cuda_runtime.h>
#include <cusparse_v2.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) { exit(code); }
   }
}

extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }

/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
    switch (error)
    {

        case CUSPARSE_STATUS_SUCCESS:
            return "CUSPARSE_STATUS_SUCCESS";

        case CUSPARSE_STATUS_NOT_INITIALIZED:
            return "CUSPARSE_STATUS_NOT_INITIALIZED";

        case CUSPARSE_STATUS_ALLOC_FAILED:
            return "CUSPARSE_STATUS_ALLOC_FAILED";

        case CUSPARSE_STATUS_INVALID_VALUE:
            return "CUSPARSE_STATUS_INVALID_VALUE";

        case CUSPARSE_STATUS_ARCH_MISMATCH:
            return "CUSPARSE_STATUS_ARCH_MISMATCH";

        case CUSPARSE_STATUS_MAPPING_ERROR:
            return "CUSPARSE_STATUS_MAPPING_ERROR";

        case CUSPARSE_STATUS_EXECUTION_FAILED:
            return "CUSPARSE_STATUS_EXECUTION_FAILED";

        case CUSPARSE_STATUS_INTERNAL_ERROR:
            return "CUSPARSE_STATUS_INTERNAL_ERROR";

        case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
            return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";

        case CUSPARSE_STATUS_ZERO_PIVOT:
            return "CUSPARSE_STATUS_ZERO_PIVOT";
    }

    return "<unknown>";
}

inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
    if(CUSPARSE_STATUS_SUCCESS != err) {
        fprintf(stderr, "CUSPARSE error in file '%s', line %Ndims\Nobjs %s\nerror %Ndims: %s\nterminating!\Nobjs",__FILE__, __LINE__,err, \
                                _cusparseGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
    }
}

extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }

/********/
/* MAIN */
/********/
int main()
{
    // --- Initialize cuSPARSE
    cusparseHandle_t handle;    cusparseSafeCall(cusparseCreate(&handle));

    const int N     = 5;        // --- Size of the linear system

    // --- Lower diagonal, diagonal and upper diagonal of the system matrix
    double *h_ld = (double*)malloc(N * sizeof(double));
    double *h_d  = (double*)malloc(N * sizeof(double));
    double *h_ud = (double*)malloc(N * sizeof(double));

    h_ld[0]     = 0.;
    h_ud[N-1]   = 0.;
    for (int k = 0; k < N - 1; k++) {
        h_ld[k + 1] = -1.;
        h_ud[k]     = -1.;
    }

    for (int k = 0; k < N; k++) h_d[k] = 2.;

    double *d_ld;   gpuErrchk(cudaMalloc(&d_ld, N * sizeof(double)));
    double *d_d;    gpuErrchk(cudaMalloc(&d_d,  N * sizeof(double)));
    double *d_ud;   gpuErrchk(cudaMalloc(&d_ud, N * sizeof(double)));

    gpuErrchk(cudaMemcpy(d_ld, h_ld, N * sizeof(double), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_d,  h_d,  N * sizeof(double), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_ud, h_ud, N * sizeof(double), cudaMemcpyHostToDevice));

    // --- Allocating and defining dense host and device data vectors
    double *h_x = (double *)malloc(N * sizeof(double)); 
    h_x[0] = 100.0;  h_x[1] = 200.0; h_x[2] = 400.0; h_x[3] = 500.0; h_x[4] = 300.0;

    double *d_x;        gpuErrchk(cudaMalloc(&d_x, N * sizeof(double)));   
    gpuErrchk(cudaMemcpy(d_x, h_x, N * sizeof(double), cudaMemcpyHostToDevice));

    // --- Allocating the host and device side result vector
    double *h_y = (double *)malloc(N * sizeof(double)); 
    double *d_y;        gpuErrchk(cudaMalloc(&d_y, N * sizeof(double))); 

    cusparseSafeCall(cusparseDgtsv(handle, N, 1, d_ld, d_d, d_ud, d_x, N));

    cudaMemcpy(h_x, d_x, N * sizeof(double), cudaMemcpyDeviceToHost);
    for (int k=0; k<N; k++) printf("%f\n", h_x[k]);
}

At this gitHub repository , a comparison of different CUDA routines available in the cuSOLVER library for the solution of tridiagonal linear systems is reported. 在这个gitHub存储库中，报告了cuSOLVER库中可用于三对角线性系统解决方案的不同CUDA例程的比较。

Answer 3

Things I see: 我看到的事情：

1st __syncthreads() seems redundant. 第一__syncthreads()似乎是多余的。
There are repetitive sets of operations such as (-csub[idx-stride]*asub[idx]/bsub[idx-stride]) in your code. 您的代码中有重复的操作集，例如(-csub[idx-stride]*asub[idx]/bsub[idx-stride]) 。 Use intermediate variables to hold the result and reuse them instead of making GPU calculate those sets each time. 使用中间变量来保存结果并重复使用它们，而不是让GPU每次都计算这些集合。
Use NVIDIA profiler to see where issues are. 使用NVIDIA Profiler查看问题所在。

解决CUDA中的三对角线性系统问题

问题描述

3 个解决方案

解决方案1
6 已采纳 2013-10-23 21:17:59

解决方案2
2 2015-10-02 20:52:02

解决方案3
0 2013-10-23 17:15:17

解决CUDA中的三对角线性系统问题

问题描述

3 个解决方案

解决方案1 6 已采纳 2013-10-23 21:17:59

解决方案2 2 2015-10-02 20:52:02

解决方案3 0 2013-10-23 17:15:17

解决方案1
6 已采纳 2013-10-23 21:17:59

解决方案2
2 2015-10-02 20:52:02

解决方案3
0 2013-10-23 17:15:17