简体   繁体   English

cuSparse (cusparseDcsrgemm) 中的矩阵乘法输出错误结果

[英]Matrix multiplication in cuSparse (cusparseDcsrgemm) outputs wrong results

I am trying to compute A^TA using cuSparse.我正在尝试使用 cuSparse 计算A^TA A is a large but sparse matrix. A 是一个大而稀疏的矩阵。 The problem is when I use the function cusparseDcsrgemm , the computed output is wrong.问题是当我使用函数cusparseDcsrgemm ,计算出的输出是错误的。 Please see the below minimal example to reproduce the problem.请参阅下面的最小示例来重现问题。

CMakeLists.txt CMakeLists.txt

cmake_minimum_required(VERSION 3.11)

project(sample)

find_package(CUDA REQUIRED)

add_executable(${PROJECT_NAME} main.cpp)

target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_14)

target_include_directories(${PROJECT_NAME} SYSTEM PUBLIC ${CUDA_INCLUDE_DIRS})

target_link_libraries(${PROJECT_NAME} ${CUDA_LIBRARIES} ${CUDA_cusparse_LIBRARY})

main.cpp主程序

#include <iostream>
#include <vector>

#include <cuda_runtime_api.h>
#include <cusparse_v2.h>

int main(){
  // 3x3 identity matrix in CSR format
  std::vector<int> row;
  std::vector<int> col;
  std::vector<double> val;

  row.emplace_back(0);
  row.emplace_back(1);
  row.emplace_back(2);
  row.emplace_back(3);

  col.emplace_back(0);
  col.emplace_back(1);
  col.emplace_back(2);

  val.emplace_back(1);
  val.emplace_back(1);
  val.emplace_back(1);

  int *d_row;
  int *d_col;
  double *d_val;

  int *d_out_row;
  int *d_out_col;
  double *d_out_val;

  cudaMalloc(reinterpret_cast<void **>(&d_row), row.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_col), col.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_val), val.size() * sizeof(double));

  // we know identity transpose times identity is still identity 
  cudaMalloc(reinterpret_cast<void **>(&d_out_row), row.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_out_col), col.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_out_val), val.size() * sizeof(double));

  cudaMemcpy(
      d_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);
  cudaMemcpy(
      d_col, col.data(), sizeof(int) * col.size(), cudaMemcpyHostToDevice);
  cudaMemcpy(
      d_val, val.data(), sizeof(double) * val.size(), cudaMemcpyHostToDevice);

  cusparseHandle_t handle;
  cusparseCreate(&handle);

  cusparseMatDescr_t descr;
  cusparseCreateMatDescr(&descr);
  cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);

  cusparseMatDescr_t descr_out;
  cusparseCreateMatDescr(&descr_out);
  cusparseSetMatType(descr_out, CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase(descr_out, CUSPARSE_INDEX_BASE_ZERO);

  cusparseDcsrgemm(handle,
                   CUSPARSE_OPERATION_TRANSPOSE,
                   CUSPARSE_OPERATION_NON_TRANSPOSE,
                   3,
                   3,
                   3,
                   descr,
                   3,
                   d_val,
                   d_row,
                   d_col,
                   descr,
                   3,
                   d_val,
                   d_row,
                   d_col,
                   descr_out,
                   d_out_val,
                   d_out_row,
                   d_out_col);

  cudaMemcpy(
      row.data(), d_out_row, sizeof(int) * row.size(), cudaMemcpyDeviceToHost);
  cudaMemcpy(
      col.data(), d_out_col, sizeof(int) * col.size(), cudaMemcpyDeviceToHost);
  cudaMemcpy(
      val.data(), d_out_val, sizeof(double) * val.size(), cudaMemcpyDeviceToHost);

  std::cout << "row" << std::endl;
  for (int i : row)
  {
    std::cout << i << std::endl; //show 0 0 0 0, but it should be 0 1 2 3
  }

  std::cout << "col" << std::endl;
  for (int i : col)
  {
    std::cout << i << std::endl; //show 1 0 0, but it should be 0 1 2
  }

  std::cout << "val" << std::endl;
  for (int i : val)
  {
    std::cout << i << std::endl; //show 1 0 0, but it should be 1 1 1
  }

  return 0;
}

What am I doing wrong?我究竟做错了什么?

You simply forgot one step because you tried to make an easy example.你只是忘记了一步,因为你试图举一个简单的例子。 In the documentation it is stated:文档中,它指出:

The cuSPARSE library adopts a two-step approach to complete sparse matrix. cuSPARSE 库采用两步法完成稀疏矩阵。 In the first step, the user allocates csrRowPtrC of m+1 elements and uses the function cusparseXcsrgemmNnz() to determine csrRowPtrC and the total number of nonzero elements.在第一个步骤中,用户分配csrRowPtrCm+1元素,并使用该函数cusparseXcsrgemmNnz()来确定csrRowPtrC和非零元素的总数。

What you did is to allocate m+1 ( m=3 in your example) elements for d_row_out and you determined the total number of nonzero elements which is 3 in your example.您所做的是为d_row_out分配m+1 (在您的示例中为m=3 )元素,并且您确定了非零元素的总数,在您的示例中为3 But you missed do "determine d_row_out " which means to fill the vector with the right values.但是你错过了“确定d_row_out ”,这意味着用正确的值填充向量。 In your simple example you could just add the line在您的简单示例中,您只需添加行

cudaMemcpy(d_out_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);

somewhere before your gemm call.在您的 gemm 电话之前的某个地方。

The more general approach of course would be to use the suggested function cusparseXcsrgemmNnz() .当然,更通用的方法是使用建议的函数cusparseXcsrgemmNnz() You could add the following lines somewhere before your gemm call (many values are still hardcoded as in your example, so it's not really general):您可以在 gemm 调用之前的某处添加以下几行(许多值仍然像您的示例中那样硬编码,所以它不是很普遍):

int nnz_check[1];
cusparseXcsrgemmNnz(handle,
                    CUSPARSE_OPERATION_TRANSPOSE,
                    CUSPARSE_OPERATION_NON_TRANSPOSE,
                    3,
                    3,
                    3,
                    descr,
                    3,
                    d_row,
                    d_col,
                    descr,
                    3,
                    d_row,
                    d_col,
                    descr_out,
                    d_out_row,  // the values this pointer points to will be set
                    nnz_check); // the number of nonzeros will also be calculated
assert(nnz_check[0] == 3);

Side note: The documentation says "[[DEPRECATED]] use cusparse<t>csrgemm2() instead. The routine will be removed in the next major release", that is version 11. The problem still remains for the second gemm version though as the same two-step approach is used.旁注:文档说“[[DEPRECATED]] cusparse<t>csrgemm2()使用cusparse<t>csrgemm2() 。该例程将在下一个主要版本中删除”,即版本 11。第二个 gemm 版本仍然存在问题使用相同的两步法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM