R中的矩阵运算：并行化，稀疏运算，GPU计算

Question

The basic aim of my question is how to achieve the best performance of matrix operations in R using Matrix package. 我的问题的基本目标是如何使用Matrix包在R实现矩阵运算的最佳性能。 In particular I want to parallelize operations (multiplication) and work with sparse matrices using computation on CUDA GPU. 特别是我希望并行化操作（乘法）并使用CUDA GPU上的计算来处理稀疏矩阵。

Details 细节

According to the documentation of the Matrix package in R cran 根据R cran Matrix包的文档

A rich hierarchy of matrix classes, including triangular, symmetric, and diagonal matrices, both dense and sparse and with pattern, logical and numeric entries. 丰富的矩阵类层次结构，包括三角形，对称矩阵和对角矩阵，包括密集和稀疏矩阵，以及模式，逻辑和数字条目。 Numerous methods for and operations on these matrices, using 'LAPACK' and 'SuiteSparse' libraries. 使用'LAPACK'和'SuiteSparse'库对这些矩阵进行了大量的操作和操作。

It seems that thanks to the SuiteSparse I should be able to perform basic operations on sparse matrices using the GPU (CUDA). 看来，由于SuiteSparse我应该能够使用GPU（CUDA）对稀疏矩阵执行基本操作。 In particular the documentation of the SuiteSparse lists the following: 特别是SuiteSparse的文档列出了以下内容：

SSMULT and SFMULT: sparse matrix multiplication. SSMULT和SFMULT：稀疏矩阵乘法。

On my Gentoo I have installed suitesparse-4.2.1 along with suitesparseconfig-4.2.1-r1 . 在我的Gentoo上，我安装了suitesparse-4.2.1以及suitesparseconfig-4.2.1-r1 。 Also I have lapack , scalapack and blas . 我也有lapack ， scalapack和blas 。 The R sessionInfo() looks as follows: R sessionInfo()如下所示：

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Gentoo/Linux

Matrix products: default
BLAS: /usr/lib64/blas/reference/libblas.so.0.0.0
LAPACK: /usr/lib64/lapack/reference/liblapack.so.0.0.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Matrix_1.2-10

loaded via a namespace (and not attached):
[1] compiler_3.4.1  grid_3.4.1      lattice_0.20-35

I have also set the environmental variable: 我还设置了环境变量：

export CHOLMOD_USE_GPU=1

which I found on one forum and potentially should allow GPU usage. 这是我在一个论坛上发现的，可能应该允许GPU使用。

Basically, everything looks as ready to go, however, when I run a simple test: 基本上，当我运行一个简单的测试时，一切看起来都准备好了：

library(Matrix)
M1<-rsparsematrix(10000,10000,0.01) 
M<-M1%*%t(M1)

It seems that GPU are not working, as if R ignores the suitesparse features. 似乎GPU没有工作，好像R忽略了suitesparse功能。

I know the questions are quite broad, but: 我知道问题很广泛，但是：

Does anyone have idea if R should be compiled in a specific, strict way to work with suitesparse ? 有没有人知道是否应该以特定的，严格的方式编译R来使用suitesparse ？
How to make sure that Matrix package uses all shared libraries for parallelization and sparse operations (with GPU usage)? 如何确保Matrix包使用所有共享库进行并行化和稀疏操作（使用GPU）？
Can anyone confirm that he was able to run matrices operations on CUDA/GPU computations using Matrix package? 任何人都可以确认他能够使用Matrix包在CUDA / GPU计算上运行矩阵运算吗？

As far as I looked through the Stack and other forums, this question shouldn't be a duplicate. 据我浏览Stack和其他论坛，这个问题不应该是重复的。

Answer 1

It is not that easy as you described. 这并不像你描述的那么容易。 Matrix package contains subset of SuiteSparse and this subset is built-in into package. Matrix包包含SuiteSparse 子集，该子集内置于包中。 So Matrix doesn't use your system SuiteSparse (you can easily browse Matrix source code here ). 因此Matrix不使用您的系统SuiteSparse （您可以在此处轻松浏览Matrix源代码）。
sparse_matrix * sparse_matrix multiplications are hard to efficiently parallelize - strategies vary a lot depending on the structure of both matrices. sparse_matrix * sparse_matrix乘法很难有效地并行化 - 策略根据两个矩阵的结构而变化很大。
In many cases computations are memory-bound, not CPU bound 在许多情况下，计算受内存限制，而不受CPU限制
You may have worse performance on GPU compared to CPU due to the memory issues described above + memory access patterns. 由于上述内存问题和内存访问模式，GPU上的性能可能比CPU差。
According to my knowledge there are couple of libraries which implements multithreaded SSMULT - Intel MKL and librsb , but I haven't heard about R interface. 根据我的知识，有几个库实现了多线程SSMULT - 英特尔MKL和librsb ，但我还没有听说过R接口。
If matrix is huge you can partition your matrix manually and use standard mclapply . 如果矩阵很大，您可以手动分区矩阵并使用标准mclapply 。 I doubt this will help. 我怀疑这会有所帮助。
You can try to use Eigen and RcppEigen and perform SSMULT there. 您可以尝试使用Eigen和RcppEigen并在那里执行SSMULT。 I believe it could be quite faster (but still single threaded). 我相信它可能会更快（但仍然是单线程）。
Ultimately I would think about how to reformulate problem and avoid SSMULT 最终我会考虑如何重新制定问题并避免SSMULT

R中的矩阵运算：并行化，稀疏运算，GPU计算

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-11-28 05:54:02

R中的矩阵运算：并行化，稀疏运算，GPU计算

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-11-28 05:54:02

解决方案1
3 已采纳 2017-11-28 05:54:02