简体   繁体   English

子集非NA

[英]Subsetting non-NA

I have a matrix in which every row has at least one NA cell, and every column has at least one NA cell as well. 我有一个矩阵,其中每行至少有一个NA单元格,每列也至少有一个NA单元格。 What I need is to find the largest subset of this matrix that contains no NAs. 我需要的是找到不包含NA的该矩阵的最大子集。

For example, for this matrix A 例如,对于这个矩阵A

A <- 
  structure(c(NA, NA, NA, NA, 2L, NA,
              1L, 1L, 1L, 0L, NA, NA,
              1L, 8L, NA, 1L, 1L, NA, 
              NA, 1L, 1L, 6L, 1L, 3L, 
              NA, 1L, 5L, 1L, 1L, NA),
            .Dim = c(6L, 5L),
            .Dimnames = 
              list(paste0("R", 1:6),
                   paste0("C", 1:5)))

A
    C1  C2  C3  C4  C5
R1  NA  1   1   NA  NA
R2  NA  1   8   1   1
R3  NA  1   NA  1   5
R4  NA  0   1   6   1
R5  2   NA  1   1   1
R6  NA  NA  NA  3   NA

There are two solutions (8 cells): A[c(2, 4), 2:5] and A[2:5, 4:5] , though finding just one valid solution is enough for my purposes. 有两个解决方案(8个单元格): A[c(2, 4), 2:5]A[2:5, 4:5] ,虽然只找到一个有效的解决方案就足够了我的目的。 The dimensions of my actual matrix are 77x132. 我的实际矩阵的尺寸是77x132。

Being a noob, I see no obvious way to do this. 作为一个菜鸟,我认为没有明显的方法可以做到这一点。 Could anyone help me with some ideas? 有人能帮助我一些想法吗?

1) optim In this approach we relax the problem to a continuous optimization problem which we solve with optim . 1)优化在这种方法中,我们将问题放宽到我们用optim解决的连续优化问题。

The objective function is f and the input to it is a 0-1 vector whose first nrow(A) entries correspond to rows and whose remaining entries correspond to columns. 目标函数是f并且它的输入是0-1向量,其第一个nrow(A)条目对应于行,其余条目对应于列。 f uses a matrix Ainf which is derived from A by replacing the NAs with a large negative number and the non-NAs with 1. In terms of Ainf the negative of the number of elements in the rectangle of rows and columns corresponding to x is -x[seq(6)] %*% Ainf %*$ x[-seq(6)] which we minimize as a function of x subject to each component of x lying between 0 and 1. f使用矩阵Ainf ,它是通过用大的负数代替NA Ainf用1从A得到的。就Ainf而言,对应于x的行和列矩形中的元素数的负数是-x[seq(6)] %*% Ainf %*$ x[-seq(6)] ,我们最小化的一个函数x受到的每个分量x 0和1之间卧。

Although this is a relaxation of the original problem to continuous optimization it seems that we get an integer solution, as desired, anyways. 虽然这是将原始问题放宽到连续优化,但似乎我们仍然可以根据需要得到整数解。

Actually most of the code below is just to get the starting value. 实际上下面的大多数代码只是为了得到起始值。 To do that we first apply seriation. 为此,我们首先应用系列化。 This permutes the rows and columns giving a more blocky structure and then in the permuted matrix we find the largest square submatrix. 这会置换行和列,从而产生更多块状结构,然后在置换矩阵中找到最大的正方形子矩阵。

In the case of the specific A in the question the largest rectangular submatrix happens to be square and the starting values are already sufficiently good that they produce the optimum but we will perform the optimization anyways so it works in general. 在问题中特定A的情况下,最大的矩形子矩阵恰好是正方形,并且起始值已经足够好以至于它们产生最优,但我们将执行优化,因此它通常起作用。 You can play around with different starting values if you like. 如果您愿意,可以使用不同的起始值。 For example, change k from 1 to some higher number in largestSquare in which case largestSquare will return k columns giving k starting values which can be used in k runs of optim taking the best. 例如,在largestSquare中将k从1更改为更高的数字,在这种情况下, largestSquare将返回k列,从而给出k起始值,这些起始值可用于optim koptim运行中。

If the starting values are sufficiently good then this should produce the optimum. 如果起始值足够好,则应该产生最佳值。

library(seriation) # only used for starting values

A.na <- is.na(A) + 0

Ainf <- ifelse(A.na, -prod(dim(A)), 1) # used by f
nr <- nrow(A) # used by f
f <- function(x) - c(x[seq(nr)] %*% Ainf %*% x[-seq(nr)])

# starting values

# Input is a square matrix of zeros and ones.
# Output is a matrix with k columns such that first column defines the
# largest square submatrix of ones, second defines next largest and so on.
# Based on algorithm given here:
# http://www.geeksforgeeks.org/maximum-size-sub-matrix-with-all-1s-in-a-binary-matrix/
largestSquare <- function(M, k = 1) {
  nr <- nrow(M); nc <- ncol(M)
  S <- 0*M; S[1, ] <- M[1, ]; S[, 1] <- M[, 1]
  for(i in 2:nr) 
    for(j in 2:nc)
      if (M[i, j] == 1) S[i, j] = min(S[i, j-1], S[i-1, j], S[i-1, j-1]) + 1
  o <- head(order(-S), k)
  d <- data.frame(row = row(M)[o], col = col(M)[o], mx = S[o])
  apply(d, 1, function(x) { 
    dn <- dimnames(M[x[1] - 1:x[3] + 1, x[2] - 1:x[3] + 1])
    out <- c(rownames(M) %in% dn[[1]], colnames(M) %in% dn[[2]]) + 0
    setNames(out, unlist(dimnames(M)))
  })
}
s <- seriate(A.na)
p <- permute(A.na, s)
# calcualte largest square submatrix in p of zeros rearranging to be in A's  order
st <- largestSquare(1-p)[unlist(dimnames(A)), 1]

res <- optim(st, f, lower = 0*st, upper = st^0, method = "L-BFGS-B")

giving: 赠送:

> res
$par
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5 
 0  1  1  1  0  0  0  1  0  1  1 

$value
[1] -9

$counts
function gradient 
       1        1 

$convergence
[1] 0

$message
[1] "CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL"

2) GenSA Another possibility is to repeat (1) but instead of using optim use GenSA from the GenSA package. 2)GenSA另一种可能性是重复(1)但不使用来自GenSA包的optim使用GenSA It does not require starting values (although you can provide a starting value using the par argument and this might improve the solution in some cases) so the code is considerably shorter but since it uses simulated annealing it can be expected to take substantially longer to run. 它不需要起始值(虽然您可以使用par参数提供起始值,这可能会在某些情况下改善解决方案)因此代码要短得多,但由于它使用模拟退火,因此可能需要花费更长的时间来运行。 Using f (and nr and Ainf which f uses) from (1). 使用(1)中的f (以及f使用的nrAinf )。 Below we try it without a starting value. 下面我们尝试没有起始值。

library(GenSA)
resSA <- GenSA(lower = rep(0, sum(dim(A))), upper = rep(1, sum(dim(A))), fn = f)

giving: 赠送:

> setNames(resSA$par, unlist(dimnames(A)))
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5 
 0  1  1  1  0  0  0  1  0  1  1 

> resSA$value
[1] -9

I have a solution, but it doesn't scale very well: 我有一个解决方案,但它不能很好地扩展:

findBiggestSubmatrixNonContiguous <- function(A) {
    A <- !is.na(A); ## don't care about non-NAs
    howmany <- expand.grid(nr=seq_len(nrow(A)),nc=seq_len(ncol(A)));
    howmany <- howmany[order(apply(howmany,1L,prod),decreasing=T),];
    for (ri in seq_len(nrow(howmany))) {
        nr <- howmany$nr[ri];
        nc <- howmany$nc[ri];
        rcom <- combn(nrow(A),nr);
        ccom <- combn(ncol(A),nc);
        comcom <- expand.grid(ri=seq_len(ncol(rcom)),ci=seq_len(ncol(ccom)));
        for (comi in seq_len(nrow(comcom)))
            if (all(A[rcom[,comcom$ri[comi]],ccom[,comcom$ci[comi]]]))
                return(list(ri=rcom[,comcom$ri[comi]],ci=ccom[,comcom$ci[comi]]));
    }; ## end for
    NULL;
}; ## end findBiggestSubmatrixNonContiguous()

It's based on the idea that if the matrix has a small enough density of NAs, then by searching for the largest submatrices first, you'll be likely to find a solution fairly quickly. 它的基础是如果矩阵具有足够小的NA密度,那么通过首先搜索最大的子矩阵,您可能会很快找到解决方案。

The algorithm works by computing a cartesian product of all counts of rows and counts of columns that could be indexed out of the original matrix to produce the submatrix. 该算法通过计算所有行和列的笛卡尔积来工作,这些列可以从原始矩阵中索引以产生子矩阵。 The set of pairs of counts is then decreasingly ordered by the size of the submatrix that would be produced by each pair of counts; 然后,这组计数对按子矩阵的大小递减排序,子矩阵将由每对计数产生; in other words, ordered by the product of the two counts. 换句话说,按两个计数的乘积排序。 It then iterates over these pairs. 然后迭代这些对。 For each pair, it computes all combinations of row indexes and column indexes that could be taken for that pair of counts, and tries each combination in turn until it finds a submatrix that contains zero NAs. 对于每对,它计算可以对该对计数采用的行索引和列索引的所有组合,并依次尝试每个组合,直到找到包含零个NA的子矩阵。 Upon finding such a submatrix, it returns that set of row and column indexes as a list. 找到这样的子矩阵后,它将该组行和列索引作为列表返回。

The result is guaranteed to be correct because it tries submatrix sizes in decreasing order, so the first one it finds must be the biggest (or tied for the biggest) possible submatrix that satisfies the condition. 结果保证是正确的,因为它以递减顺序尝试子矩阵大小,因此它找到的第一个必须是满足条件的最大(或最大)可能的子矩阵。


## OP's example matrix
A <- data.frame(C1=c(NA,NA,NA,NA,2L,NA),C2=c(1L,1L,1L,0L,NA,NA),C3=c(1L,8L,NA,1L,1L,NA),C4=c(NA,1L,1L,6L,1L,3L),C5=c(NA,1L,5L,1L,1L,NA),row.names=c('R1','R2','R3','R4','R5','R6'));
A;
##    C1 C2 C3 C4 C5
## R1 NA  1  1 NA NA
## R2 NA  1  8  1  1
## R3 NA  1 NA  1  5
## R4 NA  0  1  6  1
## R5  2 NA  1  1  1
## R6 NA NA NA  3 NA
system.time({ res <- findBiggestSubmatrixNonContiguous(A); });
##    user  system elapsed
##   0.094   0.000   0.100
res;
## $ri
## [1] 2 3 4
##
## $ci
## [1] 2 4 5
##
A[res$ri,res$ci];
##    C2 C4 C5
## R2  1  1  1
## R3  1  1  5
## R4  0  6  1

We see that the function works very quickly on the OP's example matrix, and returns a correct result. 我们看到该函数在OP的示例矩阵上运行得非常快,并返回正确的结果。


randTest <- function(NR,NC,probNA,seed=1L) {
    set.seed(seed);
    A <- replicate(NC,sample(c(NA,0:9),NR,prob=c(probNA,rep((1-probNA)/10,10L)),replace=T));
    print(A);
    print(system.time({ res <- findBiggestSubmatrixNonContiguous(A); }));
    print(res);
    print(A[res$ri,res$ci,drop=F]);
    invisible(res);
}; ## end randTest()

I wrote the above function to make testing easier. 我写了上面的函数,使测试更容易。 We can call it to test a random input matrix of size NR by NC , with a probability of choosing NA in any given cell of probNA . 我们可以将其称为通过NC测试大小为NR的随机输入矩阵,并且在任何给定的probNA单元中选择NA的概率。


Here are a few trivial tests: 以下是一些简单的测试:

randTest(8L,1L,1/3);
##      [,1]
## [1,]   NA
## [2,]    1
## [3,]    4
## [4,]    9
## [5,]   NA
## [6,]    9
## [7,]    0
## [8,]    5
##    user  system elapsed
##   0.016   0.000   0.003
## $ri
## [1] 2 3 4 6 7 8
##
## $ci
## [1] 1
##
##      [,1]
## [1,]    1
## [2,]    4
## [3,]    9
## [4,]    9
## [5,]    0
## [6,]    5

randTest(11L,3L,4/5);
##       [,1] [,2] [,3]
##  [1,]   NA   NA   NA
##  [2,]   NA   NA   NA
##  [3,]   NA   NA   NA
##  [4,]    2   NA   NA
##  [5,]   NA   NA   NA
##  [6,]    5   NA   NA
##  [7,]    8    0    4
##  [8,]   NA   NA   NA
##  [9,]   NA   NA   NA
## [10,]   NA    7   NA
## [11,]   NA   NA   NA
##    user  system elapsed
##   0.297   0.000   0.300
## $ri
## [1] 4 6 7
##
## $ci
## [1] 1
##
##      [,1]
## [1,]    2
## [2,]    5
## [3,]    8

randTest(10L,10L,1/3);
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   NA   NA    0    3    8    3    9    1    6    NA
##  [2,]    1   NA   NA    4    5    8   NA    8    2    NA
##  [3,]    4    2    5    3    7    6    6    1    1     5
##  [4,]    9    1   NA   NA    4   NA   NA    1   NA     9
##  [5,]   NA    7   NA    8    3   NA    5    3    7     7
##  [6,]    9    3    1    2    7   NA   NA    9   NA     7
##  [7,]    0    2   NA    7   NA   NA    3    8    2     6
##  [8,]    5    0    1   NA    3    3    7    1   NA     6
##  [9,]    5    1    9    2    2    5   NA    7   NA     8
## [10,]   NA    7    1    6    2    6    9    0   NA     5
##    user  system elapsed
##   8.985   0.000   8.979
## $ri
## [1]  3  4  5  6  8  9 10
##
## $ci
## [1]  2  5  8 10
##
##      [,1] [,2] [,3] [,4]
## [1,]    2    7    1    5
## [2,]    1    4    1    9
## [3,]    7    3    3    7
## [4,]    3    7    9    7
## [5,]    0    3    1    6
## [6,]    1    2    7    8
## [7,]    7    2    0    5

I don't know an easy way of verifying if the above result is correct, but it looks good to me. 我不知道一种简单的方法来验证上述结果是否正确,但对我来说它看起来不错。 But it took almost 9 seconds to generate this result. 但是生成这个结果花了将近9秒钟。 Running the function on moderately larger matrices, especially a 77x132 matrix, is probably a lost cause. 在适度大的矩阵上运行函数,尤其是77x132矩阵,可能是一个失败的原因。

Waiting to see if someone can come up with a brilliant efficient solution... 等待有人能想出一个出色的高效解决方案......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM