简体   繁体   English

如何在CUDA设备上更改稀疏矩阵的子矩阵

[英]How to change sub-matrix of a sparse matrix on CUDA device

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. 我有一个稀疏的矩阵结构,可与CUBLAS结合使用以实现线性求解器类。 I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7). 我预计我将要解决的稀疏矩阵的维数将非常大(大约为10 ^ 7 x 10 ^ 7)。 I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well. 我还将预期,求解器将需要多次使用,并且该矩阵的一部分也将需要多次更新(在计算解决方案之间)。

Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time. 将整个矩阵结构从系统内存复制到设备内存可能会成为性能瓶颈,因为在给定时间只需要更改一部分矩阵条目即可。

What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix. 我想做的是拥有一种只更新特定子集/子矩阵的方法,而不是每次需要更改矩阵时将整个矩阵结构从系统内存复制到设备内存。

The matrix data structure would reside on the CUDA device in arrays: d_col, d_row, and d_val 矩阵数据结构将以数组d_col,d_row和d_val的形式驻留在CUDA设备上

On the system side I would have corresponding arrays I, J, and val. 在系统方面,我将拥有相应的数组I,J和val。

So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed. 因此,理想情况下,我只想更改与已更改的系统数组val中的值相对应的d_val子集。

Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value. 请注意,我预计不会将任何条目添加到矩阵中或从矩阵中删除,只有现有条目的值会发生变化。

Naively I would think that to implement this, I would have an integer array or vector on the host side, eg updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val. 天真的,我认为要实现此目标,我将在主机端具有一个整数数组或向量,例如updateInds,它将跟踪val中已更改的条目的索引,但是我不确定如何有效地告诉CUDA设备更新d_val的相应值。

In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val? 本质上:如何将索引updateInds [1],updateInds [2],...,updateInds [n]的CUDA设备端数组(d_val)中的条目更改为一组新值val [updatInds [1] ],val [updateInds [2]],...,val [updateInds [3]],而没有将整个val数组从系统内存复制到CUDA设备内存数组d_val?

As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated. 只要您只想更改与CSR(或CSC或COO)稀疏矩阵表示形式关联的值数组的数值 ,该过程就不会很复杂。

Suppose I have code like this (excerpted from the CUDA conjugate gradient sample ): 假设我有这样的代码(摘自CUDA共轭梯度样本 ):

checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);

Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val : 现在,在代码的这一点之后,让我们假设我需要更改d_val数组中的某些值,以与我在val所做的更改相对应:

for (int i = 10; i < 25; i++)
  val[i] = 4.0f;

The process to move these particular changes is conceptually the same as if you were updating an array using memcpy , but we will use cudaMemcpy to update the d_val array on the device: 从概念上讲,移动这些特定更改的过程与使用memcpy更新数组的过程相同,但是我们将使用cudaMemcpy更新设备上的d_val数组:

cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);

Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer. 由于这些值都是连续的,因此我可以使用单个cudaMemcpy调用来实现传输。

If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy , one for each region. 如果我有几个与上面相似的不相交的区域,则将需要对cudaMemcpy多个调用,每个区域一个。 If, by chance, the regions are equally spaced and of equal length: 如果碰巧这些区域之间的距离相等且长度相等,则:

for (int i = 10; i < 5; i++)
  val[i] = 1.0f;
for (int i = 20; i < 5; i++)
  val[i] = 2.0f;
for (int i = 30; i < 5; i++)
  val[i] = 4.0f;

then it would also be possible to perform this transfer using a single call to cudaMemcpy2D . 那么也可以使用一次对cudaMemcpy2D调用来执行此传输。 The method is outlined here . 该方法概述这里

Notes: 笔记:

  1. cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements . 与使用相同数量元素cudaMemcpy操作相比, cudaMemcpy2D的速度比您预期的要慢。
  2. CUDA API calls have some inherent overhead. CUDA API调用具有一些固有的开销。 If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation. 如果要以分散的方式更新矩阵的很大一部分,则实际上可以更快地转移整个d_val数组,这是利用可以使用单个cudaMemcpy操作完成的事实。
  3. The method described here cannot be used if non-zero values change their location in the sparse matrix. 如果非零值更改其在稀疏矩阵中的位置 ,则无法使用此处描述的方法。 In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. 在这种情况下,我无法提供有关如何通过手术方式更新设备上的CSR稀疏矩阵的一般性答案。 And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway. 而且,某些相对简单的更改可能仍需要更新大多数阵列数据(3个向量)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM