R在稀疏矩阵上扫描

Question

I'm attempting to apply the sweep function to a sparse matrix ( dgCMatrix ). 我正在尝试将sweep功能应用于稀疏矩阵（ dgCMatrix ）。 Unfortunately, when I do that I get a memory error. 不幸的是，当我这样做时，我遇到了内存错误。 It seems that sweep is expanding my sparse matrix to a full dense matrix. 扫描似乎将我的稀疏矩阵扩展为一个完整的密集矩阵。

If there an easy way to perform this function without if blowing up my memory? 是否有一种简便的方法来执行此功能而不会消耗我的内存？

This is what I'm trying to do. 这就是我想要做的。

sparse_matrix <- sweep(sparse_matrix, 1, vector_to_multiply, '*')

Answer 1

I second @user20650's recommendation to use direct multiplication of the form mat * vec which multiplies every column of your matrix mat with your vector vec by implicitly recycling vec . 我第二个@ user20650的建议是使用mat * vec形式的直接乘法，它通过隐式回收vec将矩阵mat每一列与向量vec相乘。

Processing time profiling 处理时间分析

I understand that you're main requirement here is memory, but it's interesting to perform a microbenchmark comparison of the sweep and direct multiplication methods for both a dense and sparse matrix: 我了解您对内存的主要要求是，但对密集和稀疏矩阵执行sweep和直接乘法的微microbenchmark比较很有趣：

# Sample data
library(Matrix)  
set.seed(2018)
mat <- matrix(sample(c(0, 1), 10^6, replace = T), nrow = 10^3)
mat_sparse <- Matrix(mat, sparse = T)
vec <- 1:dim(mat)[1]

library(microbenchmark)
res <- microbenchmark(
    sweep_dense = sweep(mat, 1, vec, '*'),
    sweep_sparse = sweep(mat_sparse, 1, vec, '*'),
    mult_dense = mat * vec,
    mult_sparse = mat_sparse * vec
)
res
Unit: milliseconds
         expr        min         lq       mean     median        uq      max
  sweep_dense   8.639459  10.038711  14.857274  13.064084  18.07434  32.2172
 sweep_sparse 116.649865 128.111162 162.736864 135.932811 155.63415 369.3997
   mult_dense   2.030882   3.193082   7.744076   4.033918   7.10471 184.9396
  mult_sparse  12.998628  15.020373  20.760181  16.894000  22.95510 201.5509

library(ggplot2)
autoplot(res)

On average the operations involving a sparse matrix are actually slightly slower than the ones with a dense matrix. 平均而言，涉及稀疏矩阵的运算实际上比具有密集矩阵的运算稍慢。 Note however, how direct multiplication is faster than sweep . 但是请注意，直接乘法比sweep更快。

Memory profiling 内存分析

We can use memprof to profile the memory usage of the different approaches. 我们可以使用memprof来分析不同方法的内存使用情况。

library(profmem)
mem <- list(
    sweep_dense = profmem(sweep(mat, 1, vec, '*')),
    sweep_sparse = profmem(sweep(mat_sparse, 1, vec, '*')),
    mult_dense = profmem(sweep(mat * vec)),
    mult_sparse = profmem(sweep(mat_sparse * vec)))
lapply(mem, function(x) utils:::format.object_size(sum(x$bytes), units = "Mb"))
#$sweep_dense
#[1] "15.3 Mb"
#
#$sweep_sparse
#[1] "103.1 Mb"
#
#$mult_dense
#[1] "7.6 Mb"
#
#$mult_sparse
#[1] "13.4 Mb"

To be honest, I'm surprised that the memory imprint of the direct multiplication with a sparse matrix is not smaller than that involving a dense matrix. 老实说，我很惊讶稀疏矩阵与直接乘法的记忆烙印不小于密集矩阵。 Perhaps the sample data are too simplistic. 样本数据也许太简单了。 It might be worth exploring this with your actual data (or a representative subset thereof). 可能值得用您的实际数据（或其代表子集）进行探索。

Answer 2

I'm working with a big and very sparse dgTMatrix matrix (200k rows and 10k columns) in a NLP problem. 我正在处理NLP问题中的大型且非常稀疏的dgTMatrix矩阵（200k行和10k列）。 After hours thinking in a good solution, I created an alternative sweep function for sparse matrices. 经过数小时的思考，找到了一个好的解决方案，我为稀疏矩阵创建了一个替代的sweep函数。 It is very fast and memory efficient. 它非常快速且内存高效。 It took just 1 second and less than 1G of memory to multiply all matrix rows by a array of weights. 将所有矩阵行乘以权重数组仅需1秒且不到1G的内存。 For margin = 1 it works for both dgCMatrix and dgTMatrix . 对于margin = 1它对dgCMatrix和dgTMatrix都dgTMatrix 。

Here it follows: 如下所示：

sweep_sparse <- function(x, margin, stats, fun = "*") {
   f <- match.fun(fun)
   if (margin == 1) {
      idx <- x@i + 1
   } else {
      idx <- x@j + 1
   }
   x@x <- f(x@x, stats[idx])
   return(x)
}

R在稀疏矩阵上扫描

问题描述

2 个解决方案

解决方案1
1 2019-03-28 23:58:08

Processing time profiling 处理时间分析

Memory profiling 内存分析

解决方案2
0 2019-10-04 22:08:33

R在稀疏矩阵上扫描

问题描述

2 个解决方案

解决方案1 1 2019-03-28 23:58:08

Processing time profiling 处理时间分析

Memory profiling 内存分析

解决方案2 0 2019-10-04 22:08:33

解决方案1
1 2019-03-28 23:58:08

解决方案2
0 2019-10-04 22:08:33