简体   繁体   中英

How to split a sparse matrix into two

I have a large sparse matrix which I need to correlate which haven't been possible for me because:

  1. I cant convert the sparse matrix to a dense matrix due to R's memory limitation
  2. I tried using packages bigstats and bigmemory , my R froze over (using a windows 10, 8GB laptop)
  3. There's no correlation function in R's Matrix package

NB: I am trying to correlate a sparse matrix in the format 'M'X + X'M - M'M' which isn't possible which is why I am trying to split the sparse matrix into two or three, turn to dense matrix using as.matrix() then correlate using the cor() then combine the correlated results into one using cbind()

Now:

I want to ask if it's possible to split a sparse matrix into two or three parts, convert to dense matrices then correlate each dense matrix then cbind the two or three dense matrices into one then export to a text file.

What function can I use to split a sparse matrix into two or three bearing in mind that both the i and p parts of the sparse matrix are equal sizes with the same dim

Formal class 'dgCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:73075722]  ...
  ..@ p       : int [1:73075722] 0 0 1 1 1 1 1 2 2 2 ...
  ..@ Dim     : int [1:2] 500232 500232
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:73075722]  ...
  ..@ uplo    : chr "L"
  ..@ factors : list()

The correlation output will be in this format:

         [,1]       [,2]       [,3]        [,4]
[1,]  1.00000000 -0.8343860  0.3612926  0.09678096
[2,] -0.83438600  1.0000000 -0.8154071  0.24611830
[3,]  0.36129256 -0.8154071  1.0000000 -0.51801346
[4,]  0.09678096  0.2461183 -0.5180135  1.00000000
[5,]  0.67411584 -0.3560782 -0.1056124  0.60987601
[6,]  0.23071712 -0.4457467  0.5117711  0.21848068
[7,]  0.49200080 -0.4246502  0.2016633  0.46971736
           [,5]       [,6]       [,7]
[1,]  0.6741158  0.2307171  0.4920008
[2,] -0.3560782 -0.4457467 -0.4246502
[3,] -0.1056124  0.5117711  0.2016633
[4,]  0.6098760  0.2184807  0.4697174
[5,]  1.0000000  0.2007979  0.7198228
[6,]  0.2007979  1.0000000  0.6965899
[7,]  0.7198228  0.6965899  1.0000000

The cor() function is a way to efficiently calculate the pearson correlation between all of the columns. cor is very efficient but if we do not mind a hit in efficiency, we can calculate the pearson correlation manually:

n_row <- nrow(res)
cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in seq_len(nrow(res)-1)){
  x_delta = res[, i] - cMeans[i]
  sum_x_delta_sq = sum(x_delta^2)

  for (j in (i+1):nrow(res)){
    y_delta = res[, j] - cMeans[j]

    tmp <- sum(x_delta * y_delta) / sqrt((sum_x_delta_sq * sum(y_delta^2)))
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

As sparsity of a matrix increases, we can make the above more complicated but more performant by separating operations involving non-sparse elements and operations involving sparse elements:

n_row <- nrow(res)

cor_mat <- Matrix(0L,n_row, ncol(res))
cMeans <- Matrix::colMeans(res)

for (i in 1){

  sp_i <- res[, i, drop = F]
  i_s <- sp_i@i + 1

  i_zero_delta = 0 - cMeans[i]
  i_non_zero_delta = sp_i@x - cMeans[i]

  sum_x_delta_sq = sum(c((n_row - length(i_s)) * i_zero_delta^2, i_non_zero_delta^2))

  for (j in (i+1):nrow(res)){

    sp_j <- res[, j, drop = F]
    j_s <- sp_j@i + 1

    j_zero_delta = 0L - cMeans[j]
    j_non_zero_delta = sp_j@x - cMeans[j]

    sum_y_delta_sq = sum(c((n_row - length(j_s)) * j_zero_delta^2, j_non_zero_delta^2))

    common <- intersect(i_s, j_s)
    only_i <- setdiff(i_s, j_s)
    only_j <- setdiff(j_s, i_s)
    none <- n_row - length(c(common, only_i, only_j))

    numerator = 0

    if (length(common) > 0) numerator = numerator + sum(i_non_zero_delta[match(common, i_s)] * j_non_zero_delta[match(common, j_s)])
    if (length(only_i) > 0) numerator = numerator + j_zero_delta * sum(i_non_zero_delta[match(only_i, i_s)])
    if (length(only_j) > 0) numerator = numerator + i_zero_delta * sum(j_non_zero_delta[match(only_j, j_s)])
    if (length(none) > 0)   numerator = numerator + i_zero_delta * j_zero_delta * none

    tmp <- numerator / sqrt(sum_x_delta_sq * sum_y_delta_sq)
    if (abs(tmp) > 0.05) cor_mat[i, j] <- tmp
  }
}

I implement dicing of a sparse_matrix by row indexing:

as.DF_spM <- function(data.use,chun_size="20000000",dg_class="dgCMatrix") {
    lapply(  split(seq(nrow(data.use)), (seq(nrow(data.use))-1) %/% as.numeric(chun_size) )  , function(nx) {
        switch(dg_class,
               dgTMatrix = {as(as.matrix(data.use[nx,]),"dgTMatrix")},
               dgCMatrix = {as(as.matrix(data.use[nx,]),"dgCMatrix")},
               dgRMatrix = {as(as.matrix(data.use[nx,]),"dgRMatrix")}
               )
        }) %>% Matrix.utils::rBind.fill()
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM