[英]Find max and min matrix marginal total variability
Is there a more elegant way to work out the maximum or minimum level of variability (CV) in marginal column totals of a binary matrix based on its fill and size? 有没有一种更优雅的方法可以根据二进制矩阵的填充和大小来计算其最大或最小变化率(CV)? Considering that all row and column totals must be non-zero.
考虑到所有行和列的总数必须不为零。 eg
例如
foo(n_col, n_row, fill){ get maximum possible CV }
Let's say we have a matrix called m
where all column and row totals are > 0
but the matrix is minimally filled. 假设我们有一个名为
m
的矩阵,其中所有列和行的总数均> 0
但矩阵的填充程度最低。
m <- matrix(rep(0,25), nrow = 5)
diag(m) <- 1
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 0 0 0 0
#[2,] 0 1 0 0 0
#[3,] 0 0 1 0 0
#[4,] 0 0 0 1 0
#[5,] 0 0 0 0 1
variability1 <- sd(colSums(m))/mean(colSums(m))
variability1
# [1] 0
# the maximum and minimum for this fill is zero
# considering that all column and row totals must be > 0
Perhaps we could check the maximum at increasing levels of fill like: 也许我们可以检查填充量增加时的最大值,例如:
# find out which matrix elements are zeros
empty <- which(m < 1)
# vector for results
variability <- rep(NA, length(empty))
#
for(i in 1:length(variability)){
m[empty[[i]] ] <- 1
variability[[i]] <- sd(colSums(m))/mean(colSums(m))
}
# we get what should the maximum CV for each given level of matrix fill...
c(variability1, variability)
I think filling the matrix column-wise like this maintains the maximum variability in the marginal column totals? 我认为像这样按列填充矩阵可以保持边际列总计的最大可变性? Is there a simpler way to work this for maximum and minimum variability for matrices of different sizes, fills and shapes?
有没有更简单的方法来处理不同大小,填充和形状的矩阵的最大和最小可变性?
The following provides an alternative formulation of the problem as an optimization over the choice of the the vector of column sums of a binary matrix that maximizes the variability for a given fill
. 下面提供了该问题的一种替代形式,作为对二进制矩阵列和的向量的选择的优化,它最大化了给定
fill
的可变性。 Informal arguments of the validity of this formulation and the resulting algorithm to solve it are provided. 提供了该公式有效性的非正式论据以及解决该问题的算法。 The resulting algorithm is consistent with the OP's assertion
生成的算法与OP的断言一致
filling the matrix column-wise like this maintains the maximum variability in the marginal column totals
像这样逐列填充矩阵,以保持边际列总计的最大可变性
First, define fill
to be the number of 1
's in the n_row
by n_col
binary matrix m
. 首先,通过
n_col
二进制矩阵m
将fill
定义为n_row
中1
的数目。 From the constraints of the problem statement that m
is a binary matrix with all row and column sums greater than zero, fill
is an integer in the range [max(n_row, n_col),n_row*n_col]
. 根据问题陈述的约束,即
m
是所有行和列之和均大于零的二进制矩阵, fill
是范围[max(n_row, n_col),n_row*n_col]
。
The problem is then for a given value of fill
in the range [max(n_row, n_col),n_row*n_col]
, find the maximum 然后,问题是对于给定的
fill
值,范围为[max(n_row, n_col),n_row*n_col]
,找到最大值
sd(colSums(m))/mean(colSums(m))
over all m
such that m
is a binary matrix with fill
number of 1
's and with all row and column sums greater than zero. 在所有
m
,使得m
是fill
数为1
的二进制矩阵,并且所有行和列之和都大于零。
We note that it is better to specify the domain of this optimization problem in terms of the vector of column sums of m
rather than m
itself. 我们注意到最好用
m
的列总和的向量而不是m
本身来指定此优化问题的域。 This is because there exist different m
's with the same vector of column sums and therefore the same objective value. 这是因为存在不同的
m
,它们具有相同的列和向量,因此具有相同的目标值。 Denoting the vector of column sums as x
, the above optimization problem can be restated as one of maximizing: 将列总和的向量表示为
x
,可以将上述优化问题重申为最大化之一:
sd(x)/mean(x)
such that each element of x
is an integer in the range [1, n_row]
and sum(x)
is fill
. 使得
x
每个元素都是[1, n_row]
范围内的整数,并且sum(x)
是fill
。
Furthermore, since sum(x)
is constrained to be equal to fill
, the denominator term mean(x)
is constant over all x
for a given fill
. 此外,由于将
sum(x)
约束为等于fill
,因此对于给定fill
,分母项mean(x)
在所有x
都是常数。 Consequently, an equivalent objective function to maximize is simply sd(x)
or equivalently the variance of x
. 因此,等效目标函数最大化是简单地
sd(x)
或等效的方差x
。
To maximize the variance of x
, we need to choose x
such that the difference between its values are maximized while still satisfying the constraints on x
. 为了使
x
的方差最大化,我们需要选择x
,以使其值之间的差异最大,同时仍然满足对x
的约束。 Here, we can think about this problem inductively with respect to fill
. 在这里,我们可以相对于
fill
归纳地考虑这个问题。 Let's assume that for a given fill
, we have the solution for x
that maximizes the variance of x
while satisfying its constraints. 假设对于给定的
fill
,我们有x
的解决方案,可以在满足x
的约束的同时最大化x
的方差。 The question becomes: when we increment fill
to fill + 1
, what is the new x
that maximizes its variance? 问题就变成了:当我们将
fill
增加到fill + 1
,最大化其方差的新x
是什么? Because we have the constraint that sum(x)=fill
and each element in x
is an integer, incrementing fill
implies that we must increment one and only one element of x
. 因为我们有
sum(x)=fill
并且x
中的每个元素都是整数的约束,所以递增fill
意味着我们必须递增x
一个且仅一个元素。 For the moment relax the upper limit constraint on each element in x
(ie, x[i] <= n_row
for all i
in [1,n_col]
), then the question becomes: which element in x
to increment that maximizes the increase in the variance of x
. 暂时放宽对
x
每个元素的上限约束(即,对于[1,n_col]
所有i
, x[i] <= n_row
),那么问题就变成了: x
哪个元素要递增,从而最大程度地增加x
的方差。 For the answer to this question, we can look at the Taylor series expansion of var(x)
: 对于这个问题的答案,我们可以看一下
var(x)
的泰勒级数展开式:
var(x + dx) = var(x) + gradient(var(x)) %*% dx + 1/2 * t(dx) %*% Hessian(var(x)) %*% dx
where dx
is a vector of length n_col
with one element equal to 1
and all other elements 0
(ie, an indicator vector). 其中
dx
是长度为n_col
的向量,其中一个元素等于1
,所有其他元素为0
(即指示符向量)。 Since var(x)
is quadratic in x
, a second order expansion is sufficient. 由于
var(x)
是在二次x
,第二级膨胀是足够的。 Furthermore, since dx
is an indicator vector, only the diagonal elements of the Hessian matrix matter. 此外,由于
dx
是指示向量,因此仅Hessian矩阵的对角线元素很重要。 These are given by: 这些是由:
gradient(var(x))[i] = 2*(x[i]-mean(x))/(n_col-1), for all i in [1,n_col]
Hessian(var(x))[i,i] = 2/n_col , for all i in [1,n_col]
Since all the diagonal terms of the Hessian are the same, the second order term of the Taylor series is the same for any choice of dx
. 由于Hessian的所有对角项均相同,因此对于
dx
任何选择,泰勒级数的二阶项均相同。 Consequently, only the first order term matters in determining which element in x
to increment that maximizes the increase in the variance of x
. 因此,只有在确定所述第一阶项事项该元件在
x
到增量最大化在的方差的增加x
。 From the gradient terms, it is clear that we should choose to increment the i
-th element in x
that has the largest current value x[i]
in order to maximize the increase in the variance of x
. 从梯度项来看,很明显,我们应该选择增加
x
中具有最大当前值x[i]
第i
个元素,以最大程度地增加x
的方差。 Now, we reintroduce the upper limit constraint on each element of x
. 现在,我们在
x
每个元素上重新引入上限约束。 Then, the optimal choice is to increment the i
-th element in x
that has the largest current value x[i] < n_row
. 然后,最佳选择是增加
x
中具有最大当前值x[i] < n_row
第i
个元素。 Note that if there are multiple such elements in x
that have same maximum value x[i] < n_row
, then choosing any one of these will result in the same maximal increase in the variance of x
. 请注意,如果
x
中存在多个具有相同最大值x[i] < n_row
此类元素,则选择这些元素中的任何一个都将导致x
的方差最大相同增加。
What we have shown so far is that given a fill
and the solution for x
that maximizes the variance of x
while satisfying its constraints, we have a rule dx
that maximizes the incremental increase in the variance of x
for fill + 1
. 到目前为止,我们已经显示的是,给定
fill
和x
的解,可以在满足x
的约束的同时最大化x
的方差,我们有一个规则dx
可以最大化fill + 1
的x
的增量增量。 It remains to show that this rule results in a new x
that is the optimal x
that maximizes the variance of x
for the new fill + 1
. 仍然需要说明的是,该规则会产生一个新的
x
,它是最佳x
,它使新fill + 1
的x
的方差最大。 We now show this by contradiction. 现在,我们通过矛盾来证明这一点。 Specifically, if this new
x
does not maximize the variance of x
for fill + 1
, then there must exist another vector of column sums x_1
for fill
and a different rule dx_1
such that 具体来说,如果此新
x
不能使fill + 1
的x
的方差最大化,则必须存在另一个用于fill
的列总和x_1
向量,以及另一个规则dx_1
,使得
var(x_1 + dx_1) > var(x + dx)
However, since x
maximizes var(x)
for fill
and the equations for the gradient and Hessian holds for any x
, we have: 但是,由于
x
使fill
var(x)
最大化,并且任何x
的梯度方程式和Hessian都成立,我们有:
var(x_1 + dx_1) = var(x_1) + gradient(var(x_1)) %*% dx_1 + 1/2 * t(dx_1) %*% Hessian(var(x_1)) %*% dx_1
<= var(x_1) + 2*(max(x_1)-mean(x_1))/(n_col-1) + constant
<= var(x) + 2*(max(x)-mean(x))/(n_col-1) + constant
= var(x + dx)
and hence the contradiction. 因此矛盾。 To explain the steps more clearly:
为了更清楚地说明步骤:
dx_1
at x_1
is the one that increments the maximum element in x_1
and hence gradient(var(x_1)) %*% dx_1 <= 2*(max(x_1)-mean(x_1))/(n_col-1)
. dx_1
在x_1
是一个增量的最大元素x_1
因此gradient(var(x_1)) %*% dx_1 <= 2*(max(x_1)-mean(x_1))/(n_col-1)
。 Also, the second order term is a constant for all x
and dx
for a given fill
, so we simply state that as a constant
. fill
,二阶项对于所有x
和dx
都是常数,因此我们仅将其声明为constant
。 var(x_1) <= var(x)
by our assumption that x
maximizes the variance at fill
, (ii) gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1)
for the optimal rule dx
at fill
, and (iii) max(x_1) <= max(x)
given that x
maximizes the variance at fill
. var(x_1) <= var(x)
,假设x
使fill
的方差最大化,(ii) gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1)
表示在fill
的最佳规则dx
,以及(iii) max(x_1) <= max(x)
, gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1)
是x
最大化了fill
的方差。 To see the latter, consider a common predecessor vector of column sums x_-1
for fill-1
. fill-1
的列总和x_-1
的公共前导向量。 The difference in rules between incrementing x_-1
to x
and x_1
is simply a different choice of the element in x_-1
to increment. x_-1
递增到x
和x_1
之间的规则差异仅仅是x_-1
要递增的元素的不同选择。 From the Taylor series expansion of the variance at x_-1
, it is clear that the choice of the element in x_-1
to increment to go to x
must be greater than or equal to that to go to x_1
because var(x_1) <= var(x)
. x_-1
差的泰勒级数展开中可以看出,显然x_-1
要递增到x
的元素的选择必须大于或等于要到达x_1
因为var(x_1) <= var(x)
。 Therefore, max(x_1) <= max(x)
. max(x_1) <= max(x)
。 Now extend this reasoning to any common vector of column sums at some previous fill-k >= max(n_row, n_col)
, including the initial fill
where the initial vector of column sums is all 1
's. fill-k >= max(n_row, n_col)
的列求和的任何公共向量,包括列和的初始向量均为1
的初始fill
。 Then, for one path choose the optimal incremental rules as defined above to reach x
; x
; while for the other path, choose an arbitrary path of incremental rules to reach x_1
. x_1
。 Since the optimal rules always increments the largest element of the state at each step (subject to the upper limit), it is clear that once again max(x_1) <= max(x)
. max(x_1) <= max(x)
。 Finally, to complete the mathematical induction, we start with the initial fill where x
is all 1
's. 最后,为了完成数学归纳,我们从初始填充开始,其中
x
均为1
。 This trivially optimizes var(x)
since there are no other choices for x
given this initial fill. 由于给定了初始填充量,对于
x
没有其他选择,因此这对var(x)
进行了优化。 Now, the optimal incremental rule dx
is to choose the first element of x
to increment, since all elements are equal. 现在,最佳增量规则
dx
是选择x
的第一个元素递增,因为所有元素都相等。 The resulting x + dx
trivially maximizes the variance for the initial fill plus one since incrementing any other element of x
will result in the same variance. 所得的
x + dx
将初始填充的方差最小化为最大,因为增加x
任何其他元素将导致相同的方差。
The above arguments immediately suggests the following algorithm to distribute a value of fill
across the vector of column sums: 上面的论点立即建议使用以下算法在列和的向量上分配
fill
值:
x
. x
的向量中的每个元素。 i
-th element x[i] <- min(n_row, fill - (ncol_-i))
. i
个元素x[i] <- min(n_row, fill - (ncol_-i))
。 Note that we subtract (n_col-i)
from fill so that we can reserve these to fill the rest of the elements of the column sums vector with at least 1
and we limit the amount to n_row
to satisfy the constraints of the problem. (n_col-i)
,以便保留这些值以用至少1
填充列求和向量的其余元素,并将数量限制为n_row
以满足问题的约束。 fill <- fill - x[i]
fill <- fill - x[i]
This algorithm and the associated arguments validate the OP's assertion that 该算法和相关的参数验证了OP的断言:
filling the matrix column-wise like this maintains the maximum variability in the marginal column totals
像这样逐列填充矩阵,以保持边际列总计的最大可变性
In R, the code looks like: 在R中,代码如下所示:
foo <- function(n_col, n_row, fill) {
## preallocate the vector of column sums x and initialize to NA
x <- rep(NA, n_col)
for (i in seq_len(n_col)) {
x[i] <- pmin.int(n_row, fill-(n_col-i))
fill <- fill - x[i]
}
## compute the variability given the vector of column sums x
sd(x)/mean(x)
}
Recognizing that the repeated decrementaion of fill
in a loop can be replaced by a cumsum
, the above simplifies to: 认识到循环中
fill
的重复递减可以用cumsum
,以上简化为:
foo <- function(n_col, n_row, fill) {
x <- pmin.int(pmax.int(cumsum(c(fill-n_col+1,rep(-n_row+1,n_col-1))),1),n_row)
## compute the variability given the vector of column sums x
sd(x)/mean(x)
}
Using this function, we recover the OP's result: 使用此功能,我们可以恢复OP的结果:
n_col=5
n_row=5
variability <- sapply(max(n_col,n_row):(n_col*n_row), function(fill) foo(n_col, n_row, fill))
print(variability)
## [1] 0.0000000 0.3726780 0.6388766 0.8385255 0.9938080 0.8660254 0.8131156 0.8122329 0.8426501
##[10] 0.7319251 0.6666667 0.6404344 0.6443795 0.5414886 0.4707512 0.4330127 0.4259177 0.3049184
##[19] 0.1944407 0.0931695 0.0000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.