Eigenvalue solver in parallel using CUDA

Question

I have been searching and searching the web and I cannot seem to find the answer I'm looking for. I have a particular problem.

I'm editing this in order to simply the problem and hope it is more readable and understandable.

Let's say I have 5000 20x20 symmetric, dense matrices. I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices.

Sample code of the CUDA kernel would be great if possible.

Any and all help/suggestions would be appreciated!

Thanks,

Johnathan

Answer 1

I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices.

It's questionable to me whether this would be the fastest approach, but it might be for very small matrices. Even in this situation, there might be some data storage optimizations that could be made (interleaving global data across threads), but this would complicate things.

As stated, that request could be mapped into an "embarrassingly parallel" algorithm, where each thread works on a fully independent problem. We need only find suitable single threaded "donor code". After a quick google search, I came across this . It is fairly straightforward to modify that code to run in this thread-independent way. We need only borrow 3 routines ( jacobi_eigenvalue , r8mat_diag_get_vector and r8mat_identity ), and suitably decorate these routines with __host__ __device__ for use on the GPU, while making no other changes .

The code in question, appears to be GNU LGPL licensed by J Burkardt at Florida State University. Therefore, with this in view, and following conventional wisdom I have not included any significant amount of that code in this answer. But you should be able to reconstruct my results experimentally using the instructions I give.

NOTE: I'm not sure what the legal ramifications are of using this code, which claims to be GNU LGPL licensed. You should be sure to adhere to any necessary requirements if you elect to use this code or portions of it. My primary purpose in using it here is to demonstrate the concept of a relatively trivial "embarassingly parallel" extension of a single-threaded problem solver.

It should be trivial to reconstruct my full code by going here and copy-pasting the 3 indicated functions into the places indicated in the remaining code-skeleton. But this doesn't change any of the previously mentioned notices/disclaimers. Use it at your own risk.

Again, making no other changes might not be the best idea from a performance standpoint, but it results in a trivial amount of effort and can serve as a possibly useful starting point. Some possible optimizations could be:

seek out a data interleaving strategy so that adjacent threads are more likely to be reading adjacent data
eliminate the new and delete functions from the thread code, and replace it with a fixed allocation (this is easy to do)
remove unnecessary code - for example that which computes and sorts the eigenvectors, if that data is unneeded

In any event, with the above decorated donor code, we need only wrap a trivial kernel ( je ) around it, to launch each thread operating on separate data sets (ie matrices), and each thread produces its own set of eigenvalues (and eigenvectors - for this particular code base).

I've crafted it to work with just 3 threads and 3 4x4 matrices for test purposes, but it should be trivial to extend it to however many matrices/threads you wish.

For brevity of presentation, I've dispensed with the usual error checking , but I recommend you use it or at least run your code with cuda-memcheck if you make any modifications.

I've also built the code to adjust the device heap size upward to accommodate the in-kernel new operations, depending on number of matrices (ie. threads) and matrix dimensions. If you worked on the 2nd optimization mentioned above, you could probably remove this.

t1177.cu:

#include <stdio.h>
#include <iostream>
const int num_mat = 3; // total number of matrices = total number of threads
const int N = 4;   // square symmetric matrix dimension
const int nTPB = 256;  // threads per block

// test symmetric matrices

  double a1[N*N] = {
      4.0,  -30.0,    60.0,   -35.0, 
    -30.0,  300.0,  -675.0,   420.0, 
     60.0, -675.0,  1620.0, -1050.0, 
    -35.0,  420.0, -1050.0,   700.0 };

  double a2[N*N] = {
    4.0, 0.0, 0.0, 0.0, 
    0.0, 1.0, 0.0, 0.0, 
    0.0, 0.0, 3.0, 0.0, 
    0.0, 0.0, 0.0, 2.0 };

  double a3[N*N] = {
    -2.0,   1.0,   0.0,   0.0,
     1.0,  -2.0,   1.0,   0.0,
     0.0,   1.0,  -2.0,   1.0,
     0.0,   0.0,   1.0,  -2.0 }; 


/* ---------------------------------------------------------------- */
//
// the following functions come from here:
//
// https://people.sc.fsu.edu/~jburkardt/cpp_src/jacobi_eigenvalue/jacobi_eigenvalue.cpp
//
// attributed to j. burkardt, FSU
// they are unmodified except to add __host__ __device__ decorations
//
//****************************************************************************80
__host__ __device__
void r8mat_diag_get_vector ( int n, double a[], double v[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void r8mat_identity ( int n, double a[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void jacobi_eigenvalue ( int n, double a[], int it_max, double v[], 
  double d[], int &it_num, int &rot_num )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */

// end of FSU code
/* ---------------------------------------------------------------- */

__global__ void je(int num_matr, int n, double *a, int it_max, double *v, double *d){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int it_num;
  int rot_num;
  if (idx < num_matr){
    jacobi_eigenvalue(n, a+(idx*n*n), it_max, v+(idx*n*n), d+(idx*n), it_num, rot_num);
  }
}

void initialize_matrix(int mat_id, int n, double *mat, double *v){

  for (int i = 0; i < n*n; i++) *(v+(mat_id*n*n)+i) = mat[i];
}

void print_vec(int vec_id, int n, double *d){

  std::cout << "matrix " << vec_id << " eigenvalues: " << std::endl;
  for (int i = 0; i < n; i++) std::cout << i << ": " << *(d+(n*vec_id)+i) << std::endl;
  std::cout << std::endl;
}
int main(){
// make sure device heap has enough space for in-kernel new allocations
  const int heapsize = num_mat*N*sizeof(double)*2;
  const int chunks = heapsize/(8192*1024) + 1;
  cudaError_t cudaStatus = cudaDeviceSetLimit(cudaLimitMallocHeapSize, (8192*1024) * chunks);
  if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "set device heap limit failed!");
    }
  const int max_iter = 1000;
  double *h_a, *d_a, *h_v, *d_v, *h_d, *d_d;
  h_a = (double *)malloc(num_mat*N*N*sizeof(double));
  h_v = (double *)malloc(num_mat*N*N*sizeof(double));
  h_d = (double *)malloc(num_mat*  N*sizeof(double));
  cudaMalloc(&d_a, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_v, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_d, num_mat*  N*sizeof(double));
  memset(h_a, 0, num_mat*N*N*sizeof(double));
  memset(h_v, 0, num_mat*N*N*sizeof(double));
  memset(h_d, 0, num_mat*  N*sizeof(double));
  initialize_matrix(0, N, a1, h_a);
  initialize_matrix(1, N, a2, h_a);
  initialize_matrix(2, N, a3, h_a);
  cudaMemcpy(d_a, h_a, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_v, h_v, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_d, h_d, num_mat*  N*sizeof(double), cudaMemcpyHostToDevice);
  je<<<(num_mat+nTPB-1)/nTPB, nTPB>>>(num_mat, N, d_a, max_iter, d_v, d_d);
  cudaMemcpy(h_d, d_d, num_mat*N*sizeof(double), cudaMemcpyDeviceToHost);
  print_vec(0, N, h_d);
  print_vec(1, N, h_d);
  print_vec(2, N, h_d);
  return 0;
}

compile and sample run:

$ nvcc -o t1177 t1177.cu
$ cuda-memcheck ./t1177
========= CUDA-MEMCHECK
matrix 0 eigenvalues:
0: 0.166643
1: 1.47805
2: 37.1015
3: 2585.25

matrix 1 eigenvalues:
0: 1
1: 2
2: 3
3: 4

matrix 2 eigenvalues:
0: -3.61803
1: -2.61803
2: -1.38197
3: -0.381966

========= ERROR SUMMARY: 0 errors
$

The output seems plausible to me, mostly matching the output here .

Eigenvalue solver in parallel using CUDA

Question

1 answers

solution1
2 2016-07-04 22:43:10

Eigenvalue solver in parallel using CUDA

Question

1 answers

solution1 2 2016-07-04 22:43:10

solution1
2 2016-07-04 22:43:10