[英]R: sparse matrix conversion
I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors. 我在R中有一个因子矩阵,并希望将其转换为虚拟变量0-1的矩阵,用于每个因子的所有可能级别。
However this "dummy" matrix is very large (91690x16593) and very sparse. 然而,这个“虚拟”矩阵非常大(91690x16593)并且非常稀疏。 I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram. 我需要将它存储在稀疏矩阵中,否则它不适合我的12GB内存。
Currently, I am using the following code and it works very fine and takes seconds: 目前,我使用以下代码,它工作得很好,需要几秒钟:
library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)
However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr()
, so first I need to convert my sparse matrix to the SparseM format. 但是,我想在R中使用e1071软件包,并最终使用write.matrix.csr()
将此矩阵保存为libsvm格式,因此首先我需要将稀疏矩阵转换为SparseM格式。
I tried to do: 我试着这样做:
library(SparseM)
X2 <- as.matrix.csr(X)
but it very quickly fills my RAM and eventually R crashes. 但它很快填满我的RAM,最终R崩溃。 I suspect that internally, as.matrix.csr
first converts the sparse matrix to a dense matrix that does not fit in my computer memory. 我怀疑在内部, as.matrix.csr
首先将稀疏矩阵转换为不适合我的计算机内存的密集矩阵。
My other alternative would be to create my sparse matrix directly in the SparseM format. 我的另一种选择是直接以SparseM格式创建稀疏矩阵。
I tried as.matrix.csr(X_factors)
but it does not accept a data-frame of factors. 我尝试了as.matrix.csr(X_factors)
但它不接受数据框的因素。
Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors)
in the SparseM package? SparseM包中是否有sparse.model.matrix(~.-1, data = X_factors)
的等价物? I searched in the documentation but I did not find. 我在文档中搜索但我没有找到。
Quite tricky but I think I got it. 相当棘手,但我想我明白了。
Let's start with a sparse matrix from the Matrix
package: 让我们从Matrix
包中的稀疏矩阵开始:
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)
The Matrix
package uses a column-oriented compression format, while SparseM
supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other. Matrix
包使用面向列的压缩格式,而SparseM
支持面向列和行的格式,并且具有可以轻松处理从一种格式到另一种格式的转换的功能。
So we will first convert our column-oriented Matrix
into a column-oriented SparseM
matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0
or 1
): 因此,我们首先将面向列的Matrix
转换为面向列的SparseM
矩阵:我们只需要小心调用正确的构造函数并注意两个包对索引使用不同的约定(从0
或1
开始):
X.csc <- new("matrix.csc", ra = X@x,
ja = X@i + 1L,
ia = X@p + 1L,
dimension = X@Dim)
Then, change from column-oriented to row-oriented format: 然后,从面向列的格式更改为面向行的格式:
X.csr <- as.matrix.csr(X.csc)
And you're done! 而且你已经完成了! You can check that the two matrices are identical (on my small example) by doing: 您可以通过执行以下操作来检查两个矩阵是否相同(在我的小示例中):
range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.