简体   繁体   English

如何将稀疏矩阵一分为二

[英]How to split a sparse matrix into two

I have a large sparse matrix which I need to correlate which haven't been possible for me because:我有一个大的稀疏矩阵,我需要将其关联起来,这对我来说是不可能的,因为:

  1. I cant convert the sparse matrix to a dense matrix due to R's memory limitation由于 R 的 memory 限制,我无法将稀疏矩阵转换为密集矩阵
  2. I tried using packages bigstats and bigmemory , my R froze over (using a windows 10, 8GB laptop)我尝试使用包bigstatsbigmemory ,我的 R 冻结了(使用 windows 10、8GB 笔记本电脑)
  3. There's no correlation function in R's Matrix package R的Matrix package中没有相关性function

NB: I am trying to correlate a sparse matrix in the format 'M'X + X'M - M'M' which isn't possible which is why I am trying to split the sparse matrix into two or three, turn to dense matrix using as.matrix() then correlate using the cor() then combine the correlated results into one using cbind()注意:我正在尝试以'M'X + X'M - M'M'格式关联一个稀疏矩阵,这是不可能的,这就是为什么我试图将稀疏矩阵分成两个或三个,转向密集使用as.matrix()的矩阵,然后使用cor()进行关联,然后使用cbind() () 将相关结果合并为一个

Now:现在:

I want to ask if it's possible to split a sparse matrix into two or three parts, convert to dense matrices then correlate each dense matrix then cbind the two or three dense matrices into one then export to a text file.我想问是否可以将稀疏矩阵拆分为两个或三个部分,转换为密集矩阵,然后关联每个密集矩阵,然后将两个或三个密集矩阵合并为一个,然后导出到文本文件。

What function can I use to split a sparse matrix into two or three bearing in mind that both the i and p parts of the sparse matrix are equal sizes with the same dim我可以使用什么 function 将稀疏矩阵拆分为两个或三个,记住稀疏矩阵的ip部分大小相等且dim相同

Formal class 'dgCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:73075722]  ...
  ..@ p       : int [1:73075722] 0 0 1 1 1 1 1 2 2 2 ...
  ..@ Dim     : int [1:2] 500232 500232
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:73075722]  ...
  ..@ uplo    : chr "L"
  ..@ factors : list()

The correlation output will be in this format:相关性 output 将采用以下格式:

         [,1]       [,2]       [,3]        [,4]
[1,]  1.00000000 -0.8343860  0.3612926  0.09678096
[2,] -0.83438600  1.0000000 -0.8154071  0.24611830
[3,]  0.36129256 -0.8154071  1.0000000 -0.51801346
[4,]  0.09678096  0.2461183 -0.5180135  1.00000000
[5,]  0.67411584 -0.3560782 -0.1056124  0.60987601
[6,]  0.23071712 -0.4457467  0.5117711  0.21848068
[7,]  0.49200080 -0.4246502  0.2016633  0.46971736
           [,5]       [,6]       [,7]
[1,]  0.6741158  0.2307171  0.4920008
[2,] -0.3560782 -0.4457467 -0.4246502
[3,] -0.1056124  0.5117711  0.2016633
[4,]  0.6098760  0.2184807  0.4697174
[5,]  1.0000000  0.2007979  0.7198228
[6,]  0.2007979  1.0000000  0.6965899
[7,]  0.7198228  0.6965899  1.0000000

The cor() function is a way to efficiently calculate the pearson correlation between all of the columns. cor() function 是一种有效计算所有列之间的pearson相关性的方法。 cor is very efficient but if we do not mind a hit in efficiency, we can calculate the pearson correlation manually: cor非常有效,但如果我们不介意效率下降,我们可以手动计算pearson相关性:

n_row <- nrow(res)
cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in seq_len(nrow(res)-1)){
  x_delta = res[, i] - cMeans[i]
  sum_x_delta_sq = sum(x_delta^2)

  for (j in (i+1):nrow(res)){
    y_delta = res[, j] - cMeans[j]

    tmp <- sum(x_delta * y_delta) / sqrt((sum_x_delta_sq * sum(y_delta^2)))
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

As sparsity of a matrix increases, we can make the above more complicated but more performant by separating operations involving non-sparse elements and operations involving sparse elements:随着矩阵稀疏度的增加,我们可以通过分离涉及非稀疏元素的操作和涉及稀疏元素的操作来使上述更复杂但性能更高:

n_row <- nrow(res)

cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in 1){

  sp_i <- res[, i, drop = F]
  i_s <- sp_i@i + 1

  i_zero_delta = 0 - cMeans[i]
  i_non_zero_delta = sp_i@x - cMeans[i]

  sum_x_delta_sq = sum(c((n_row - length(i_s)) * i_zero_delta^2, i_non_zero_delta^2))

  for (j in (i+1):nrow(res)){

    sp_j <- res[, j, drop = F]
    j_s <- sp_j@i + 1

    j_zero_delta = 0L - cMeans[j]
    j_non_zero_delta = sp_j@x - cMeans[j]

    sum_y_delta_sq = sum(c((n_row - length(j_s)) * j_zero_delta^2, j_non_zero_delta^2))

    common <- intersect(i_s, j_s)
    only_i <- setdiff(i_s, j_s)
    only_j <- setdiff(j_s, i_s)
    none <- n_row - length(c(common, only_i, only_j))

    numerator = 0

    if (length(common) > 0) numerator = numerator + sum(i_non_zero_delta[match(common, i_s)] * j_non_zero_delta[match(common, j_s)])
    if (length(only_i) > 0) numerator = numerator + j_zero_delta * sum(i_non_zero_delta[match(only_i, i_s)])
    if (length(only_j) > 0) numerator = numerator + i_zero_delta * sum(j_non_zero_delta[match(only_j, j_s)])
    if (length(none) > 0)   numerator = numerator + i_zero_delta * j_zero_delta * none

    tmp <- numerator / sqrt(sum_x_delta_sq * sum_y_delta_sq)
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

I implement dicing of a sparse_matrix by row indexing:我通过行索引实现 sparse_matrix 的切片:

as.DF_spM <- function(data.use,chun_size="20000000",dg_class="dgCMatrix") {
    lapply(  split(seq(nrow(data.use)), (seq(nrow(data.use))-1) %/% as.numeric(chun_size) )  , function(nx) {
        switch(dg_class,
               dgTMatrix = {as(as.matrix(data.use[nx,]),"dgTMatrix")},
               dgCMatrix = {as(as.matrix(data.use[nx,]),"dgCMatrix")},
               dgRMatrix = {as(as.matrix(data.use[nx,]),"dgRMatrix")}
               )
        }) %>% Matrix.utils::rBind.fill()
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM