简体   繁体   English

查找最大和最小矩阵边际总变异性

[英]Find max and min matrix marginal total variability

Is there a more elegant way to work out the maximum or minimum level of variability (CV) in marginal column totals of a binary matrix based on its fill and size? 有没有一种更优雅的方法可以根据二进制矩阵的填充和大小来计算其最大最小变化率(CV)? Considering that all row and column totals must be non-zero. 考虑到所有行和列的总数必须不为零。 eg 例如

foo(n_col, n_row, fill){ get maximum possible CV }

Let's say we have a matrix called m where all column and row totals are > 0 but the matrix is minimally filled. 假设我们有一个名为m的矩阵,其中所有列和行的总数均> 0但矩阵的填充程度最低。

m <- matrix(rep(0,25), nrow = 5)
diag(m) <- 1
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    1    0    0    0    0
#[2,]    0    1    0    0    0
#[3,]    0    0    1    0    0
#[4,]    0    0    0    1    0
#[5,]    0    0    0    0    1

variability1 <- sd(colSums(m))/mean(colSums(m))
variability1
# [1] 0
# the maximum and minimum for this fill is zero 
# considering that  all column and row totals must be > 0

Perhaps we could check the maximum at increasing levels of fill like: 也许我们可以检查填充量增加时的最大值,例如:

# find out which matrix elements are zeros
empty <- which(m < 1)
# vector for results
variability <- rep(NA, length(empty))
#
for(i in 1:length(variability)){
m[empty[[i]] ] <- 1
variability[[i]] <- sd(colSums(m))/mean(colSums(m))
}
# we get what should the maximum CV for each given level of matrix fill...
c(variability1, variability)

I think filling the matrix column-wise like this maintains the maximum variability in the marginal column totals? 我认为像这样按列填充矩阵可以保持边际列总计的最大可变性? Is there a simpler way to work this for maximum and minimum variability for matrices of different sizes, fills and shapes? 有没有更简单的方法来处理不同大小,填充和形状的矩阵的最大和最小可变性?

The following provides an alternative formulation of the problem as an optimization over the choice of the the vector of column sums of a binary matrix that maximizes the variability for a given fill . 下面提供了该问题的一种替代形式,作为对二进制矩阵列和的向量的选择的优化,它最大化了给定fill的可变性。 Informal arguments of the validity of this formulation and the resulting algorithm to solve it are provided. 提供了该公式有效性的非正式论据以及解决该问题的算法。 The resulting algorithm is consistent with the OP's assertion 生成的算法与OP的断言一致

filling the matrix column-wise like this maintains the maximum variability in the marginal column totals 像这样逐列填充矩阵,以保持边际列总计的最大可变性

First, define fill to be the number of 1 's in the n_row by n_col binary matrix m . 首先,通过n_col二进制矩阵mfill定义为n_row1的数目。 From the constraints of the problem statement that m is a binary matrix with all row and column sums greater than zero, fill is an integer in the range [max(n_row, n_col),n_row*n_col] . 根据问题陈述的约束,即m是所有行和列之和均大于零的二进制矩阵, fill是范围[max(n_row, n_col),n_row*n_col]

The problem is then for a given value of fill in the range [max(n_row, n_col),n_row*n_col] , find the maximum 然后,问题是对于给定的fill值,范围为[max(n_row, n_col),n_row*n_col] ,找到最大值

 sd(colSums(m))/mean(colSums(m))

over all m such that m is a binary matrix with fill number of 1 's and with all row and column sums greater than zero. 在所有m ,使得mfill数为1的二进制矩阵,并且所有行和列之和都大于零。

We note that it is better to specify the domain of this optimization problem in terms of the vector of column sums of m rather than m itself. 我们注意到最好用m的列总和的向量而不是m本身来指定此优化问题的域。 This is because there exist different m 's with the same vector of column sums and therefore the same objective value. 这是因为存在不同的m ,它们具有相同的列和向量,因此具有相同的目标值。 Denoting the vector of column sums as x , the above optimization problem can be restated as one of maximizing: 将列总和的向量表示为x ,可以将上述优化问题重申为最大化之一:

sd(x)/mean(x)

such that each element of x is an integer in the range [1, n_row] and sum(x) is fill . 使得x每个元素都是[1, n_row]范围内的整数,并且sum(x)fill

Furthermore, since sum(x) is constrained to be equal to fill , the denominator term mean(x) is constant over all x for a given fill . 此外,由于将sum(x)约束为等于fill ,因此对于给定fill ,分母项mean(x)在所有x都是常数。 Consequently, an equivalent objective function to maximize is simply sd(x) or equivalently the variance of x . 因此,等效目标函数最大化是简单地sd(x)或等效的方差x

To maximize the variance of x , we need to choose x such that the difference between its values are maximized while still satisfying the constraints on x . 为了使x的方差最大化,我们需要选择x ,以使其值之间的差异最大,同时仍然满足对x的约束。 Here, we can think about this problem inductively with respect to fill . 在这里,我们可以相对于fill归纳地考虑这个问题。 Let's assume that for a given fill , we have the solution for x that maximizes the variance of x while satisfying its constraints. 假设对于给定的fill ,我们有x的解决方案,可以在满足x的约束的同时最大化x的方差。 The question becomes: when we increment fill to fill + 1 , what is the new x that maximizes its variance? 问题就变成了:当我们将fill增加到fill + 1 ,最大化其方差的新x是什么? Because we have the constraint that sum(x)=fill and each element in x is an integer, incrementing fill implies that we must increment one and only one element of x . 因为我们有sum(x)=fill并且x中的每个元素都是整数的约束,所以递增fill意味着我们必须递增x一个且仅一个元素。 For the moment relax the upper limit constraint on each element in x (ie, x[i] <= n_row for all i in [1,n_col] ), then the question becomes: which element in x to increment that maximizes the increase in the variance of x . 暂时放宽对x每个元素的上限约束(即,对于[1,n_col]所有ix[i] <= n_row ),那么问题就变成了: x哪个元素要递增,从而最大程度地增加x的方差。 For the answer to this question, we can look at the Taylor series expansion of var(x) : 对于这个问题的答案,我们可以看一下var(x)的泰勒级数展开式:

var(x + dx) = var(x) + gradient(var(x)) %*% dx + 1/2 * t(dx) %*% Hessian(var(x)) %*% dx

where dx is a vector of length n_col with one element equal to 1 and all other elements 0 (ie, an indicator vector). 其中dx是长度为n_col的向量,其中一个元素等于1 ,所有其他元素为0 (即指示符向量)。 Since var(x) is quadratic in x , a second order expansion is sufficient. 由于var(x)是在二次x ,第二级膨胀是足够的。 Furthermore, since dx is an indicator vector, only the diagonal elements of the Hessian matrix matter. 此外,由于dx是指示向量,因此仅Hessian矩阵的对角线元素很重要。 These are given by: 这些是由:

gradient(var(x))[i] = 2*(x[i]-mean(x))/(n_col-1),      for all i in [1,n_col]
Hessian(var(x))[i,i] = 2/n_col                  ,      for all i in [1,n_col]

Since all the diagonal terms of the Hessian are the same, the second order term of the Taylor series is the same for any choice of dx . 由于Hessian的所有对角项均相同,因此对于dx任何选择,泰勒级数的二阶项均相同。 Consequently, only the first order term matters in determining which element in x to increment that maximizes the increase in the variance of x . 因此,只有在确定所述第一阶项事项该元件在x到增量最大化在的方差的增加x From the gradient terms, it is clear that we should choose to increment the i -th element in x that has the largest current value x[i] in order to maximize the increase in the variance of x . 从梯度项来看,很明显,我们应该选择增加x中具有最大当前值x[i]i个元素,以最大程度地增加x的方差。 Now, we reintroduce the upper limit constraint on each element of x . 现在,我们在x每个元素上重新引入上限约束。 Then, the optimal choice is to increment the i -th element in x that has the largest current value x[i] < n_row . 然后,最佳选择是增加x中具有最大当前值x[i] < n_rowi个元素。 Note that if there are multiple such elements in x that have same maximum value x[i] < n_row , then choosing any one of these will result in the same maximal increase in the variance of x . 请注意,如果x中存在多个具有相同最大值x[i] < n_row此类元素,则选择这些元素中的任何一个都将导致x的方差最大相同增加。

What we have shown so far is that given a fill and the solution for x that maximizes the variance of x while satisfying its constraints, we have a rule dx that maximizes the incremental increase in the variance of x for fill + 1 . 到目前为止,我们已经显示的是,给定fillx的解,可以在满足x的约束的同时最大化x的方差,我们有一个规则dx可以最大化fill + 1x的增量增量。 It remains to show that this rule results in a new x that is the optimal x that maximizes the variance of x for the new fill + 1 . 仍然需要说明的是,该规则会产生一个新的x ,它是最佳x ,它使新fill + 1x的方差最大。 We now show this by contradiction. 现在,我们通过矛盾来证明这一点。 Specifically, if this new x does not maximize the variance of x for fill + 1 , then there must exist another vector of column sums x_1 for fill and a different rule dx_1 such that 具体来说,如果此新x不能使fill + 1x的方差最大化,则必须存在另一个用于fill的列总和x_1向量,以及另一个规则dx_1 ,使得

var(x_1 + dx_1) > var(x + dx)

However, since x maximizes var(x) for fill and the equations for the gradient and Hessian holds for any x , we have: 但是,由于x使fill var(x)最大化,并且任何x的梯度方程式和Hessian都成立,我们有:

var(x_1 + dx_1) = var(x_1) + gradient(var(x_1)) %*% dx_1 + 1/2 * t(dx_1) %*% Hessian(var(x_1)) %*% dx_1
                <= var(x_1) + 2*(max(x_1)-mean(x_1))/(n_col-1) + constant
                <= var(x) + 2*(max(x)-mean(x))/(n_col-1) + constant
                = var(x + dx)

and hence the contradiction. 因此矛盾。 To explain the steps more clearly: 为了更清楚地说明步骤:

  1. From step 1 to step 2, the best choice for dx_1 at x_1 is the one that increments the maximum element in x_1 and hence gradient(var(x_1)) %*% dx_1 <= 2*(max(x_1)-mean(x_1))/(n_col-1) . 从步骤1到步骤2,对于最佳选择dx_1x_1是一个增量的最大元素x_1因此gradient(var(x_1)) %*% dx_1 <= 2*(max(x_1)-mean(x_1))/(n_col-1) Also, the second order term is a constant for all x and dx for a given fill , so we simply state that as a constant . 同样,对于给定的fill ,二阶项对于所有xdx都是常数,因此我们仅将其声明为constant
  2. From step 2 to step 3, we have (i) var(x_1) <= var(x) by our assumption that x maximizes the variance at fill , (ii) gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1) for the optimal rule dx at fill , and (iii) max(x_1) <= max(x) given that x maximizes the variance at fill . 从第2步到第3步,我们假设(i) var(x_1) <= var(x) ,假设x使fill的方差最大化,(ii) gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1)表示在fill的最佳规则dx ,以及(iii) max(x_1) <= max(x)gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1)x最大化了fill的方差。 To see the latter, consider a common predecessor vector of column sums x_-1 for fill-1 . 要查看后者,请考虑fill-1的列总和x_-1的公共前导向量。 The difference in rules between incrementing x_-1 to x and x_1 is simply a different choice of the element in x_-1 to increment. x_-1递增到xx_1之间的规则差异仅仅是x_-1要递增的元素的不同选择。 From the Taylor series expansion of the variance at x_-1 , it is clear that the choice of the element in x_-1 to increment to go to x must be greater than or equal to that to go to x_1 because var(x_1) <= var(x) . x_-1差的泰勒级数展开中可以看出,显然x_-1要递增到x的元素的选择必须大于或等于要到达x_1因为var(x_1) <= var(x) Therefore, max(x_1) <= max(x) . 因此, max(x_1) <= max(x) Now extend this reasoning to any common vector of column sums at some previous fill-k >= max(n_row, n_col) , including the initial fill where the initial vector of column sums is all 1 's. 现在,将此推理扩展到任何先前fill-k >= max(n_row, n_col)的列求和的任何公共向量,包括列和的初始向量均为1的初始fill Then, for one path choose the optimal incremental rules as defined above to reach x ; 然后,为一条路径选择上面定义的最佳增量规则以达到x while for the other path, choose an arbitrary path of incremental rules to reach x_1 . 而对于另一条路径,请选择一条任意的增量规则路径以达到x_1 Since the optimal rules always increments the largest element of the state at each step (subject to the upper limit), it is clear that once again max(x_1) <= max(x) . 由于最佳规则总是在每一步增加状态的最大元素(以上限为准),因此很明显,再次max(x_1) <= max(x)

Finally, to complete the mathematical induction, we start with the initial fill where x is all 1 's. 最后,为了完成数学归纳,我们从初始填充开始,其中x均为1 This trivially optimizes var(x) since there are no other choices for x given this initial fill. 由于给定了初始填充量,对于x没有其他选择,因此这对var(x)进行了优化。 Now, the optimal incremental rule dx is to choose the first element of x to increment, since all elements are equal. 现在,最佳增量规则dx是选择x的第一个元素递增,因为所有元素都相等。 The resulting x + dx trivially maximizes the variance for the initial fill plus one since incrementing any other element of x will result in the same variance. 所得的x + dx将初始填充的方差最小化为最大,因为增加x任何其他元素将导致相同的方差。

The above arguments immediately suggests the following algorithm to distribute a value of fill across the vector of column sums: 上面的论点立即建议使用以下算法在列和的向量上分配 fill值:

  1. Loop through each element in the vector of column sums x . 循环遍历列总和x的向量中的每个元素。
  2. For each i -th element x[i] <- min(n_row, fill - (ncol_-i)) . 对于每个第i个元素x[i] <- min(n_row, fill - (ncol_-i)) Note that we subtract (n_col-i) from fill so that we can reserve these to fill the rest of the elements of the column sums vector with at least 1 and we limit the amount to n_row to satisfy the constraints of the problem. 请注意,我们从填充中减去(n_col-i) ,以便保留这些值以用至少1填充列求和向量的其余元素,并将数量限制为n_row以满足问题的约束。
  3. Update fill <- fill - x[i] 更新fill <- fill - x[i]

This algorithm and the associated arguments validate the OP's assertion that 该算法和相关的参数验证了OP的断言:

filling the matrix column-wise like this maintains the maximum variability in the marginal column totals 像这样逐列填充矩阵,以保持边际列总计的最大可变性

In R, the code looks like: 在R中,代码如下所示:

foo <- function(n_col, n_row, fill) {
  ## preallocate the vector of column sums x and initialize to NA
  x <- rep(NA, n_col)
  for (i in seq_len(n_col)) {
    x[i] <- pmin.int(n_row, fill-(n_col-i))
    fill <- fill - x[i]
  }
  ## compute the variability given the vector of column sums x
  sd(x)/mean(x)
}

Recognizing that the repeated decrementaion of fill in a loop can be replaced by a cumsum , the above simplifies to: 认识到循环中fill的重复递减可以用cumsum ,以上简化为:

foo <- function(n_col, n_row, fill) {
  x <- pmin.int(pmax.int(cumsum(c(fill-n_col+1,rep(-n_row+1,n_col-1))),1),n_row)
  ## compute the variability given the vector of column sums x
  sd(x)/mean(x)
}

Using this function, we recover the OP's result: 使用此功能,我们可以恢复OP的结果:

n_col=5
n_row=5
variability <- sapply(max(n_col,n_row):(n_col*n_row), function(fill) foo(n_col, n_row, fill))
print(variability)
## [1] 0.0000000 0.3726780 0.6388766 0.8385255 0.9938080 0.8660254 0.8131156 0.8122329 0.8426501
##[10] 0.7319251 0.6666667 0.6404344 0.6443795 0.5414886 0.4707512 0.4330127 0.4259177 0.3049184
##[19] 0.1944407 0.0931695 0.0000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM