使用CUDA并行进行特征值求解器

Question

I have been searching and searching the web and I cannot seem to find the answer I'm looking for. 我一直在搜索和搜索网络，但似乎找不到所需的答案。 I have a particular problem. 我有一个特别的问题。

I'm editing this in order to simply the problem and hope it is more readable and understandable. 我正在对此进行编辑以简化问题，并希望它更具可读性和可理解性。

Let's say I have 5000 20x20 symmetric, dense matrices. 假设我有5000个20x20对称密集矩阵。 I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices. 我想在CUDA中创建一个内核，该内核将使每个线程负责计算每个对称矩阵的特征值。

Sample code of the CUDA kernel would be great if possible. 如果可能，CUDA内核的示例代码将非常有用。

Any and all help/suggestions would be appreciated! 任何和所有帮助/建议，将不胜感激！

Thanks, 谢谢，

Johnathan 约翰娜森

Answer 1

I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices. 我想在CUDA中创建一个内核，该内核将使每个线程负责计算每个对称矩阵的特征值。

It's questionable to me whether this would be the fastest approach, but it might be for very small matrices. 我怀疑这是否是最快的方法，但是对于很小的矩阵来说可能是这样。 Even in this situation, there might be some data storage optimizations that could be made (interleaving global data across threads), but this would complicate things. 即使在这种情况下，也可能会进行一些数据存储优化（跨线程交错全局数据），但这会使事情复杂化。

As stated, that request could be mapped into an "embarrassingly parallel" algorithm, where each thread works on a fully independent problem. 如前所述，该请求可以映射到“令人尴尬的并行”算法中，其中每个线程处理完全独立的问题。 We need only find suitable single threaded "donor code". 我们只需要找到合适的单线程“供体代码”即可。 After a quick google search, I came across this . 谷歌快速搜索后，我遇到了这个问题。 It is fairly straightforward to modify that code to run in this thread-independent way. 修改该代码以这种独立于线程的方式运行非常简单。 We need only borrow 3 routines ( jacobi_eigenvalue , r8mat_diag_get_vector and r8mat_identity ), and suitably decorate these routines with __host__ __device__ for use on the GPU, while making no other changes . 我们只需要借用3个例程（ jacobi_eigenvalue ， r8mat_diag_get_vector和r8mat_identity ），并使用__host__ __device__装饰这些例程以在GPU上使用，而无需进行其他更改 。

The code in question, appears to be GNU LGPL licensed by J Burkardt at Florida State University. 有问题的代码似乎是佛罗里达州立大学J Burkardt许可的GNU LGPL。 Therefore, with this in view, and following conventional wisdom I have not included any significant amount of that code in this answer. 因此，考虑到这一点，并且按照传统的看法，我没有在此答案中包含任何数量的该代码。 But you should be able to reconstruct my results experimentally using the instructions I give. 但是您应该能够按照我给出的说明实验性地重建我的结果。

NOTE: I'm not sure what the legal ramifications are of using this code, which claims to be GNU LGPL licensed. 注意：我不确定使用此代码的法律后果，该代码声称已获得GNU LGPL许可。 You should be sure to adhere to any necessary requirements if you elect to use this code or portions of it. 如果您选择使用此代码或其一部分，则应确保遵守任何必要的要求。 My primary purpose in using it here is to demonstrate the concept of a relatively trivial "embarassingly parallel" extension of a single-threaded problem solver. 我在这里使用它的主要目的是演示单线程问题解决程序的相对琐碎的“令人尴尬的并行”扩展的概念。

It should be trivial to reconstruct my full code by going here and copy-pasting the 3 indicated functions into the places indicated in the remaining code-skeleton. 通过转到此处并将3个指示的函数复制粘贴到其余代码框架中指示的位置，来重构我的完整代码应该是微不足道的。 But this doesn't change any of the previously mentioned notices/disclaimers. 但这不会改变任何前面提到的通知/免责声明。 Use it at your own risk. 需要您自担风险使用它。

Again, making no other changes might not be the best idea from a performance standpoint, but it results in a trivial amount of effort and can serve as a possibly useful starting point. 同样，从性能的角度来看，不进行其他任何更改可能不是最好的主意，但这会导致琐碎的工作量，并且可能是一个有用的起点。 Some possible optimizations could be: 一些可能的优化可能是：

seek out a data interleaving strategy so that adjacent threads are more likely to be reading adjacent data 寻找一种数据交织策略，以便相邻线程更有可能读取相邻数据
eliminate the new and delete functions from the thread code, and replace it with a fixed allocation (this is easy to do) 从线程代码中delete new和delete函数，并用固定分配替换它（这很容易做到）
remove unnecessary code - for example that which computes and sorts the eigenvectors, if that data is unneeded 删除不必要的代码-例如，如果不需要数据，则用于计算和分类特征向量的代码

In any event, with the above decorated donor code, we need only wrap a trivial kernel ( je ) around it, to launch each thread operating on separate data sets (ie matrices), and each thread produces its own set of eigenvalues (and eigenvectors - for this particular code base). 无论如何，使用上面装饰的供体代码，我们只需要在其周围包装一个琐碎的内核（ je ），以启动在单独的数据集（即矩阵）上运行的每个线程，并且每个线程都会产生自己的特征值集（和特征向量） -针对此特定代码库）。

I've crafted it to work with just 3 threads and 3 4x4 matrices for test purposes, but it should be trivial to extend it to however many matrices/threads you wish. 为了测试目的，我精心设计了它只能与3个线程和3个4x4矩阵一起使用，但是将其扩展到任意数量的矩阵/线程应该是微不足道的。

For brevity of presentation, I've dispensed with the usual error checking , but I recommend you use it or at least run your code with cuda-memcheck if you make any modifications. 为了简洁起见， cuda-memcheck 了通常的错误检查，但是我建议您使用它，或者如果进行任何修改，至少使用cuda-memcheck运行代码。

I've also built the code to adjust the device heap size upward to accommodate the in-kernel new operations, depending on number of matrices (ie. threads) and matrix dimensions. 我还构建了代码，根据矩阵（即线程）的数量和矩阵尺寸，向上调整设备堆的大小以适应内核中的new操作。 If you worked on the 2nd optimization mentioned above, you could probably remove this. 如果您进行了上述第二次优化，则可能会删除它。

t1177.cu: t1177.cu：

#include <stdio.h>
#include <iostream>
const int num_mat = 3; // total number of matrices = total number of threads
const int N = 4;   // square symmetric matrix dimension
const int nTPB = 256;  // threads per block

// test symmetric matrices

  double a1[N*N] = {
      4.0,  -30.0,    60.0,   -35.0, 
    -30.0,  300.0,  -675.0,   420.0, 
     60.0, -675.0,  1620.0, -1050.0, 
    -35.0,  420.0, -1050.0,   700.0 };

  double a2[N*N] = {
    4.0, 0.0, 0.0, 0.0, 
    0.0, 1.0, 0.0, 0.0, 
    0.0, 0.0, 3.0, 0.0, 
    0.0, 0.0, 0.0, 2.0 };

  double a3[N*N] = {
    -2.0,   1.0,   0.0,   0.0,
     1.0,  -2.0,   1.0,   0.0,
     0.0,   1.0,  -2.0,   1.0,
     0.0,   0.0,   1.0,  -2.0 }; 


/* ---------------------------------------------------------------- */
//
// the following functions come from here:
//
// https://people.sc.fsu.edu/~jburkardt/cpp_src/jacobi_eigenvalue/jacobi_eigenvalue.cpp
//
// attributed to j. burkardt, FSU
// they are unmodified except to add __host__ __device__ decorations
//
//****************************************************************************80
__host__ __device__
void r8mat_diag_get_vector ( int n, double a[], double v[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void r8mat_identity ( int n, double a[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void jacobi_eigenvalue ( int n, double a[], int it_max, double v[], 
  double d[], int &it_num, int &rot_num )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */

// end of FSU code
/* ---------------------------------------------------------------- */

__global__ void je(int num_matr, int n, double *a, int it_max, double *v, double *d){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int it_num;
  int rot_num;
  if (idx < num_matr){
    jacobi_eigenvalue(n, a+(idx*n*n), it_max, v+(idx*n*n), d+(idx*n), it_num, rot_num);
  }
}

void initialize_matrix(int mat_id, int n, double *mat, double *v){

  for (int i = 0; i < n*n; i++) *(v+(mat_id*n*n)+i) = mat[i];
}

void print_vec(int vec_id, int n, double *d){

  std::cout << "matrix " << vec_id << " eigenvalues: " << std::endl;
  for (int i = 0; i < n; i++) std::cout << i << ": " << *(d+(n*vec_id)+i) << std::endl;
  std::cout << std::endl;
}
int main(){
// make sure device heap has enough space for in-kernel new allocations
  const int heapsize = num_mat*N*sizeof(double)*2;
  const int chunks = heapsize/(8192*1024) + 1;
  cudaError_t cudaStatus = cudaDeviceSetLimit(cudaLimitMallocHeapSize, (8192*1024) * chunks);
  if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "set device heap limit failed!");
    }
  const int max_iter = 1000;
  double *h_a, *d_a, *h_v, *d_v, *h_d, *d_d;
  h_a = (double *)malloc(num_mat*N*N*sizeof(double));
  h_v = (double *)malloc(num_mat*N*N*sizeof(double));
  h_d = (double *)malloc(num_mat*  N*sizeof(double));
  cudaMalloc(&d_a, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_v, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_d, num_mat*  N*sizeof(double));
  memset(h_a, 0, num_mat*N*N*sizeof(double));
  memset(h_v, 0, num_mat*N*N*sizeof(double));
  memset(h_d, 0, num_mat*  N*sizeof(double));
  initialize_matrix(0, N, a1, h_a);
  initialize_matrix(1, N, a2, h_a);
  initialize_matrix(2, N, a3, h_a);
  cudaMemcpy(d_a, h_a, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_v, h_v, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_d, h_d, num_mat*  N*sizeof(double), cudaMemcpyHostToDevice);
  je<<<(num_mat+nTPB-1)/nTPB, nTPB>>>(num_mat, N, d_a, max_iter, d_v, d_d);
  cudaMemcpy(h_d, d_d, num_mat*N*sizeof(double), cudaMemcpyDeviceToHost);
  print_vec(0, N, h_d);
  print_vec(1, N, h_d);
  print_vec(2, N, h_d);
  return 0;
}

compile and sample run: 编译并运行示例：

$ nvcc -o t1177 t1177.cu
$ cuda-memcheck ./t1177
========= CUDA-MEMCHECK
matrix 0 eigenvalues:
0: 0.166643
1: 1.47805
2: 37.1015
3: 2585.25

matrix 1 eigenvalues:
0: 1
1: 2
2: 3
3: 4

matrix 2 eigenvalues:
0: -3.61803
1: -2.61803
2: -1.38197
3: -0.381966

========= ERROR SUMMARY: 0 errors
$

The output seems plausible to me, mostly matching the output here . 在我看来，输出似乎是合理的，大部分与此处的输出匹配。

使用CUDA并行进行特征值求解器

问题描述

1 个解决方案

解决方案1
2 2016-07-04 22:43:10

使用CUDA并行进行特征值求解器

问题描述

1 个解决方案

解决方案1 2 2016-07-04 22:43:10

解决方案1
2 2016-07-04 22:43:10