简体   繁体   English

将稀疏矩阵拆分为线性独立子矩阵以进行回归

[英]split sparse matrix into linear independent submatrix's for regression

Problem: Reducing a data set used in regression to several smaller sets where the variables are dependent within but independent between matrices.问题:将回归中使用的数据集减少为几个较小的数据集,其中变量在矩阵内依赖但在矩阵之间独立。 I have a large data set with 1000 dummy variables, but only a few 'positive' for each row, and memory limits my ability to build different models.我有一个包含 1000 个虚拟变量的大型数据集,但每行只有几个“正数”,并且内存限制了我构建不同模型的能力。 So i'm trying to split the data set into sets where there ar linear dependency between the variables in the set, but no dependency with the other sets.所以我试图将数据集拆分为集合中的变量之间存在线性依赖关系的集合,但与其他集合没有依赖关系。

Small example:小例子:

M1 <- c(1L,0L,0L,0L,1L,1L,0L,0L,0L,0L,1L,1L,0L,0L,1L,0L)
dim(M1) <- c(4,4)

Here M1 can be split into the two 'independent matrices:这里 M1 可以分成两个“独立矩阵”:

M2 <- c(1,0,1,1)
M3 <- c(1,1,1,0)

But changing M1 to但是将 M1 更改为

M1[3,2] <- 1

Would make all row dependent and so no split is possible.将使所有行依赖,因此不可能进行拆分。

Ideally what I would like is a vector of length (nr of rows) specifying which subset a row belongs to, so that regressions could be applied on each subset.理想情况下,我想要一个长度向量(行数),指定行属于哪个子集,以便可以对每个子集应用回归。 So a result in the original case would be a vector:所以原始情况下的结果将是一个向量:

R <- c(1,1,2,2)

The problem is related to the rank but all answers that i have been able to find related to reducing the dim of the matrix and not sub setting the matrix into independent parts.问题与排名有关,但我能够找到的所有答案都与减少矩阵的暗淡有关,而不是将矩阵子设置为独立的部分。

Iteration through the matrix is a solution, which is implemented by the following functions (only 2d) Not pretty, nor using matrix information.通过矩阵迭代是一种解决方案,由以下函数实现(仅2d) 不漂亮,也不使用矩阵信息。 But posted as a way to solve the problem:但发布为解决问题的一种方法:

`%ni%` <- Negate(`%in%`)
data <- hjlpmidMatrix


getRow <- function(data, col)
  {
    as.vector(which(data[,col] == 1))

  }
getCol <- function(data, row)
{
    as.vector(which(data[row,] == 1))
}


splitmatrix <- function(data) {
if (!is.matrix(data)) {
  stop("no data frame assigned to function")
  }
if (dim(data)[2] < 1) {
  stop("no columns in data")
}
vector <- dim(c(1,2))
i <- 1
col <- 1

repeat {
  rowIndex <- NULL
  colIndex <- NULL
repeat {
col <- col[col %ni% colIndex]
if (is_empty(col)) {break}
colIndex <- c(colIndex, col)
if (length(col) != 0) { row <- sapply(col,FUN = getRow, data = data) %>% unlist %>% unique()}

row <- row[row %ni% rowIndex]
if (is_empty(row)) {break}
  rowIndex <- c(rowIndex, row)
if (length(row) != 0) { col <- sapply(row,FUN = getCol, data = data) %>% unlist %>% unique()}

}

vector <- rbind(vector, cbind(i, rowIndex))
if (dim(vector)[1] < dim(data)[1])
  {
  i <- i + 1
  col <- (1:dim(data)[2])[1:dim(data)[2] %ni% colIndex]
}
else
  {break}
}
return(vector[,1])

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM