I'm attempting to apply the sweep
function to a sparse matrix ( dgCMatrix
). Unfortunately, when I do that I get a memory error. It seems that sweep is expanding my sparse matrix to a full dense matrix.
If there an easy way to perform this function without if blowing up my memory?
This is what I'm trying to do.
sparse_matrix <- sweep(sparse_matrix, 1, vector_to_multiply, '*')
I second @user20650's recommendation to use direct multiplication of the form mat * vec
which multiplies every column of your matrix mat
with your vector vec
by implicitly recycling vec
.
I understand that you're main requirement here is memory, but it's interesting to perform a microbenchmark
comparison of the sweep
and direct multiplication methods for both a dense and sparse matrix:
# Sample data
library(Matrix)
set.seed(2018)
mat <- matrix(sample(c(0, 1), 10^6, replace = T), nrow = 10^3)
mat_sparse <- Matrix(mat, sparse = T)
vec <- 1:dim(mat)[1]
library(microbenchmark)
res <- microbenchmark(
sweep_dense = sweep(mat, 1, vec, '*'),
sweep_sparse = sweep(mat_sparse, 1, vec, '*'),
mult_dense = mat * vec,
mult_sparse = mat_sparse * vec
)
res
Unit: milliseconds
expr min lq mean median uq max
sweep_dense 8.639459 10.038711 14.857274 13.064084 18.07434 32.2172
sweep_sparse 116.649865 128.111162 162.736864 135.932811 155.63415 369.3997
mult_dense 2.030882 3.193082 7.744076 4.033918 7.10471 184.9396
mult_sparse 12.998628 15.020373 20.760181 16.894000 22.95510 201.5509
library(ggplot2)
autoplot(res)
On average the operations involving a sparse matrix are actually slightly slower than the ones with a dense matrix. Note however, how direct multiplication is faster than sweep
.
We can use memprof
to profile the memory usage of the different approaches.
library(profmem)
mem <- list(
sweep_dense = profmem(sweep(mat, 1, vec, '*')),
sweep_sparse = profmem(sweep(mat_sparse, 1, vec, '*')),
mult_dense = profmem(sweep(mat * vec)),
mult_sparse = profmem(sweep(mat_sparse * vec)))
lapply(mem, function(x) utils:::format.object_size(sum(x$bytes), units = "Mb"))
#$sweep_dense
#[1] "15.3 Mb"
#
#$sweep_sparse
#[1] "103.1 Mb"
#
#$mult_dense
#[1] "7.6 Mb"
#
#$mult_sparse
#[1] "13.4 Mb"
To be honest, I'm surprised that the memory imprint of the direct multiplication with a sparse matrix is not smaller than that involving a dense matrix. Perhaps the sample data are too simplistic. It might be worth exploring this with your actual data (or a representative subset thereof).
I'm working with a big and very sparse dgTMatrix
matrix (200k rows and 10k columns) in a NLP problem. After hours thinking in a good solution, I created an alternative sweep
function for sparse matrices. It is very fast and memory efficient. It took just 1 second and less than 1G of memory to multiply all matrix rows by a array of weights. For margin = 1
it works for both dgCMatrix
and dgTMatrix
.
Here it follows:
sweep_sparse <- function(x, margin, stats, fun = "*") {
f <- match.fun(fun)
if (margin == 1) {
idx <- x@i + 1
} else {
idx <- x@j + 1
}
x@x <- f(x@x, stats[idx])
return(x)
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.