[英]Matrix multiplication in cuSparse (cusparseDcsrgemm) outputs wrong results
I am trying to compute A^TA
using cuSparse.我正在尝试使用 cuSparse 计算A^TA
。 A is a large but sparse matrix. A 是一个大而稀疏的矩阵。 The problem is when I use the function cusparseDcsrgemm
, the computed output is wrong.问题是当我使用函数cusparseDcsrgemm
,计算出的输出是错误的。 Please see the below minimal example to reproduce the problem.请参阅下面的最小示例来重现问题。
CMakeLists.txt CMakeLists.txt
cmake_minimum_required(VERSION 3.11)
project(sample)
find_package(CUDA REQUIRED)
add_executable(${PROJECT_NAME} main.cpp)
target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_14)
target_include_directories(${PROJECT_NAME} SYSTEM PUBLIC ${CUDA_INCLUDE_DIRS})
target_link_libraries(${PROJECT_NAME} ${CUDA_LIBRARIES} ${CUDA_cusparse_LIBRARY})
main.cpp主程序
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
#include <cusparse_v2.h>
int main(){
// 3x3 identity matrix in CSR format
std::vector<int> row;
std::vector<int> col;
std::vector<double> val;
row.emplace_back(0);
row.emplace_back(1);
row.emplace_back(2);
row.emplace_back(3);
col.emplace_back(0);
col.emplace_back(1);
col.emplace_back(2);
val.emplace_back(1);
val.emplace_back(1);
val.emplace_back(1);
int *d_row;
int *d_col;
double *d_val;
int *d_out_row;
int *d_out_col;
double *d_out_val;
cudaMalloc(reinterpret_cast<void **>(&d_row), row.size() * sizeof(int));
cudaMalloc(reinterpret_cast<void **>(&d_col), col.size() * sizeof(int));
cudaMalloc(reinterpret_cast<void **>(&d_val), val.size() * sizeof(double));
// we know identity transpose times identity is still identity
cudaMalloc(reinterpret_cast<void **>(&d_out_row), row.size() * sizeof(int));
cudaMalloc(reinterpret_cast<void **>(&d_out_col), col.size() * sizeof(int));
cudaMalloc(reinterpret_cast<void **>(&d_out_val), val.size() * sizeof(double));
cudaMemcpy(
d_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);
cudaMemcpy(
d_col, col.data(), sizeof(int) * col.size(), cudaMemcpyHostToDevice);
cudaMemcpy(
d_val, val.data(), sizeof(double) * val.size(), cudaMemcpyHostToDevice);
cusparseHandle_t handle;
cusparseCreate(&handle);
cusparseMatDescr_t descr;
cusparseCreateMatDescr(&descr);
cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);
cusparseMatDescr_t descr_out;
cusparseCreateMatDescr(&descr_out);
cusparseSetMatType(descr_out, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descr_out, CUSPARSE_INDEX_BASE_ZERO);
cusparseDcsrgemm(handle,
CUSPARSE_OPERATION_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
3,
3,
3,
descr,
3,
d_val,
d_row,
d_col,
descr,
3,
d_val,
d_row,
d_col,
descr_out,
d_out_val,
d_out_row,
d_out_col);
cudaMemcpy(
row.data(), d_out_row, sizeof(int) * row.size(), cudaMemcpyDeviceToHost);
cudaMemcpy(
col.data(), d_out_col, sizeof(int) * col.size(), cudaMemcpyDeviceToHost);
cudaMemcpy(
val.data(), d_out_val, sizeof(double) * val.size(), cudaMemcpyDeviceToHost);
std::cout << "row" << std::endl;
for (int i : row)
{
std::cout << i << std::endl; //show 0 0 0 0, but it should be 0 1 2 3
}
std::cout << "col" << std::endl;
for (int i : col)
{
std::cout << i << std::endl; //show 1 0 0, but it should be 0 1 2
}
std::cout << "val" << std::endl;
for (int i : val)
{
std::cout << i << std::endl; //show 1 0 0, but it should be 1 1 1
}
return 0;
}
What am I doing wrong?我究竟做错了什么?
You simply forgot one step because you tried to make an easy example.你只是忘记了一步,因为你试图举一个简单的例子。 In the documentation it is stated:在文档中,它指出:
The cuSPARSE library adopts a two-step approach to complete sparse matrix. cuSPARSE 库采用两步法完成稀疏矩阵。 In the first step, the user allocates
csrRowPtrC
ofm+1
elements and uses the functioncusparseXcsrgemmNnz()
to determinecsrRowPtrC
and the total number of nonzero elements.在第一个步骤中,用户分配csrRowPtrC
的m+1
元素,并使用该函数cusparseXcsrgemmNnz()
来确定csrRowPtrC
和非零元素的总数。
What you did is to allocate m+1
( m=3
in your example) elements for d_row_out
and you determined the total number of nonzero elements which is 3
in your example.您所做的是为d_row_out
分配m+1
(在您的示例中为m=3
)元素,并且您确定了非零元素的总数,在您的示例中为3
。 But you missed do "determine d_row_out
" which means to fill the vector with the right values.但是你错过了“确定d_row_out
”,这意味着用正确的值填充向量。 In your simple example you could just add the line在您的简单示例中,您只需添加行
cudaMemcpy(d_out_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);
somewhere before your gemm call.在您的 gemm 电话之前的某个地方。
The more general approach of course would be to use the suggested function cusparseXcsrgemmNnz()
.当然,更通用的方法是使用建议的函数cusparseXcsrgemmNnz()
。 You could add the following lines somewhere before your gemm call (many values are still hardcoded as in your example, so it's not really general):您可以在 gemm 调用之前的某处添加以下几行(许多值仍然像您的示例中那样硬编码,所以它不是很普遍):
int nnz_check[1];
cusparseXcsrgemmNnz(handle,
CUSPARSE_OPERATION_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
3,
3,
3,
descr,
3,
d_row,
d_col,
descr,
3,
d_row,
d_col,
descr_out,
d_out_row, // the values this pointer points to will be set
nnz_check); // the number of nonzeros will also be calculated
assert(nnz_check[0] == 3);
Side note: The documentation says "[[DEPRECATED]] use cusparse<t>csrgemm2()
instead. The routine will be removed in the next major release", that is version 11. The problem still remains for the second gemm version though as the same two-step approach is used.旁注:文档说“[[DEPRECATED]] cusparse<t>csrgemm2()
使用cusparse<t>csrgemm2()
。该例程将在下一个主要版本中删除”,即版本 11。第二个 gemm 版本仍然存在问题使用相同的两步法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.