[英]constrained linear regression / quadratic programming python
I have a dataset like this:我有一个这样的数据集:
import numpy as np
a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])
m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])
and want to fit a constrained linear regression并想要拟合约束线性回归
y = b1*a + b2*b + b3*c y = b1*a + b2*b + b3*c
where all b's sum to one and are positive: b1+b2+b3=1其中所有 b 的总和为 1 且为正:b1+b2+b3=1
A similar problem in R is specified here: R中的一个类似的问题在这里具体说明:
https://stats.stackexchange.com/questions/21565/how-do-i-fit-a-constrained-regression-in-r-so-that-coefficients-total-1 https://stats.stackexchange.com/questions/21565/how-do-i-fit-a-constrained-regression-in-r-so-that-coefficients-total-1
How can I do this in python?我如何在 python 中执行此操作?
EDIT : These two approaches are very general and can work for small-medium scale instances.编辑:这两种方法非常通用,适用于中小型实例。 For a more efficient approach, check the answer of chthonicdaemon (using customized preprocessing and scipy's optimize.nnls).要获得更有效的方法,请查看chthonicdaemon的答案(使用自定义预处理和 scipy 的 optimize.nnls)。
import numpy as np
from scipy.optimize import minimize
a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])
m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])
def loss(x):
return np.sum(np.square((np.dot(x, m) - y)))
cons = ({'type': 'eq',
'fun' : lambda x: np.sum(x) - 1.0})
x0 = np.zeros(m.shape[0])
res = minimize(loss, x0, method='SLSQP', constraints=cons,
bounds=[(0, np.inf) for i in range(m.shape[0])], options={'disp': True})
print(res.x)
print(np.dot(res.x, m.T))
print(np.sum(np.square(np.dot(res.x, m) - y)))
Optimization terminated successfully. (Exit mode 0)
Current function value: 18.817792344
Iterations: 5
Function evaluations: 26
Gradient evaluations: 5
[ 0.7760881 0. 0.2239119]
[ 1.87173571 2.11955951 4.61630834]
18.817792344
Advantages:好处:
import numpy as np
from cvxpy import *
a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])
m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])
X = Variable(m.shape[0])
constraints = [X >= 0, sum_entries(X) == 1.0]
product = m.T * diag(X)
diff = sum_entries(product, axis=1) - y
problem = Problem(Minimize(norm(diff)), constraints)
problem.solve(verbose=True)
print(problem.value)
print(X.value)
ECOS 2.0.4 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
It pcost dcost gap pres dres k/t mu step sigma IR | BT
0 +0.000e+00 -0.000e+00 +2e+01 5e-01 1e-01 1e+00 4e+00 --- --- 1 1 - | - -
1 +2.451e+00 +2.539e+00 +4e+00 1e-01 2e-02 2e-01 8e-01 0.8419 4e-02 2 2 2 | 0 0
2 +4.301e+00 +4.306e+00 +2e-01 5e-03 7e-04 1e-02 4e-02 0.9619 1e-02 2 2 2 | 0 0
3 +4.333e+00 +4.334e+00 +2e-02 4e-04 6e-05 1e-03 4e-03 0.9326 2e-02 2 1 2 | 0 0
4 +4.338e+00 +4.338e+00 +5e-04 1e-05 2e-06 4e-05 1e-04 0.9698 1e-04 2 1 1 | 0 0
5 +4.338e+00 +4.338e+00 +3e-05 8e-07 1e-07 3e-06 7e-06 0.9402 7e-03 2 1 1 | 0 0
6 +4.338e+00 +4.338e+00 +7e-07 2e-08 2e-09 6e-08 2e-07 0.9796 1e-03 2 1 1 | 0 0
7 +4.338e+00 +4.338e+00 +1e-07 3e-09 4e-10 1e-08 3e-08 0.8458 2e-02 2 1 1 | 0 0
8 +4.338e+00 +4.338e+00 +7e-09 2e-10 2e-11 9e-10 2e-09 0.9839 5e-02 1 1 1 | 0 0
OPTIMAL (within feastol=1.7e-10, reltol=1.5e-09, abstol=6.5e-09).
Runtime: 0.000555 seconds.
4.337947939 # needs to be squared to be compared to scipy's output!
# as we are using l2-norm (outer sqrt) instead of sum-of-squares
# which is nicely converted to SOCP-form and easier to
# tackle by SOCP-based solvers like ECOS
# -> does not change the solution-vector x, only the obj-value
[[ 7.76094262e-01]
[ 7.39698388e-10]
[ 2.23905737e-01]]
You can get a good solution to this with a little bit of math and scipy.optimize.nnls
:你可以通过一点数学和scipy.optimize.nnls
得到一个很好的解决方案:
First we do the math:首先我们做数学:
If如果
y = b1*a + b2*b + b3*c and b1 + b2 + b3 = 1, then b3 = 1 - b1 - b2. y = b1*a + b2*b + b3*c 和 b1 + b2 + b3 = 1,然后 b3 = 1 - b1 - b2。
If we substitute and simplify we end up with如果我们替换并简化我们最终得到
y - c = b1(a - c) + b2(b - c) y - c = b1(a - c) + b2(b - c)
Now, we don't have any equality constraints and nnls can solve directly:现在,我们没有任何等式约束,nnls 可以直接解决:
import scipy.optimize
A = np.vstack([a - c, b - c]).T
(b1, b2), norm = scipy.optimize.nnls(A, y - c)
b3 = 1 - b1 - b2
This recovers the same solution as obtained in the other answer using cvxpy.这恢复了使用 cvxpy 在另一个答案中获得的相同解决方案。
b1 = 0.77608809648662802
b2 = 0.0
b3 = 0.22391190351337198
norm = 4.337947941595865
This approach can be generalised to an arbitrary number of dimensions as follows.这种方法可以推广到任意数量的维度,如下所示。 Assume that we have a matrix B constructed with a, b, c from the original question arranged in the columns.假设我们有一个矩阵 B,它由排列在列中的原始问题中的 a、b、c 构成。 Any additional dimensions will get added to this.任何额外的维度都将被添加到其中。
Now, we can do现在,我们可以做到
A = B[:, :-1] - B[:, -1:]
bb, norm = scipy.optimize.nnls(A, y - B[:, -1])
bi = np.append(bb, 1 - sum(bb))
One comment regarding sascha's scipy implementation : be aware that with scipy minimize, the trial-and-error nature of SLSQP may get you a solution that is slightly off unless you make some other specifications, namely the maximum iterations (maxiter) and maximum tolerance (ftol), as detailed in the scipy docs here .关于sascha 的 scipy 实现的一条评论:请注意,使用 scipy 最小化,SLSQP 的反复试验性质可能会为您提供一个稍微偏离的解决方案,除非您制定一些其他规范,即最大迭代 (maxiter) 和最大容差 ( FTOL),如SciPy的文档详细说明在这里。
The default values are: maxiter=100 and ftol=1e-06.默认值为:maxiter=100 和 ftol=1e-06。
Here is an example to illustrate using matrix notation: first get rid of the constraints and bounds.这是一个使用矩阵表示法说明的示例:首先摆脱约束和界限。 Also assume for simplicity that the intercept=0.为简单起见,还假设截距=0。 In that case, the coefficients for any multiple regression, as covered here on page 4, will be (precisely):在这种情况下,对于任何多元回归系数,如覆盖在这里第4页上,将是(精确):
def betas(y, x):
# y and x are ndarrays--response & design matrixes
return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))
Now, given that the goal of least squares regression is to minimize the sum of squared residuals, take sascha's loss function (re-written slightly):现在,考虑到最小二乘回归的目标是最小化残差平方和,采用 sascha 的损失函数(稍微重写):
def resids(b, y, x):
resid = y - np.dot(x, b)
return np.dot(resid.T, resid)
Given your actual Y and X vectors, you can plug in the resulting "true" betas from the first equation above into the second to get a much better "benchmark."给定您的实际 Y 和 X 向量,您可以将上面第一个方程中得到的“真实”β 值插入到第二个方程中,以获得更好的“基准”。 Compare this benchmark to the .fun attribute of res (what scipy minimize spits out).将此基准与 res 的 .fun 属性(scipy 最小化吐出的内容)进行比较。 Even tiny changes can cause meaningful changes to the resulting coefficients.即使是微小的变化也会导致结果系数发生有意义的变化。
So to make a long story short, it will sacrifice speed but improve accuracy to use something like所以长话短说,它会牺牲速度,但使用类似的东西会提高准确性
options={'maxiter' : 1000, 'ftol' : 1e-07}
within sascha's code.在 sascha 的代码中。
Your problem is a linear least squares, you could solve it directly with a quadratic programming solver using the solve_ls
function in qpsolvers .您的问题是线性最小二乘法,您可以使用 qpsolvers 中的solve_ls
function使用二次规划求解器直接求解。 Here is a snippet adapted from this post on linear regression in Python :这是改编自Python 中关于线性回归的这篇文章的片段:
from qpsolvers import solve_ls
# Objective (|| R x - s ||^2): || [a b c] x - y ||^2
R = m.T
s = y
# Linear constraint (A * x == b): sum(x) = 1
A = np.ones((1, 3))
b = np.array([1.0])
# Box constraint (lb <= x): x >= 0
lb = np.zeros(3)
x = solve_ls(R, s, A=A, b=b, lb=lb, solver="quadprog")
On my machine this code finds the solution x = array([0.7760881, 0.0, 0.2239119])
.在我的机器上,这段代码找到了解决方案x = array([0.7760881, 0.0, 0.2239119])
。 I've uploaded the full code to constrained_linear_regression.py
, feel free to try it out.我已经将完整代码上传到constrained_linear_regression.py
,请随意尝试。
thanks @sascha for the great answer!感谢@sascha 的精彩回答! the cvxpy
example there is pretty out of date, so thought I would provide a slightly different version based on their current API and with some light editing, for clarity:那里的cvxpy
示例已经过时了,所以我想我会根据他们当前的 API提供一个略有不同的版本,并为清楚起见进行一些简单的编辑:
import numpy as np
import cvxpy as cp
x1 = np.arange(1000)
x2 = np.random.normal(size=(1000,))
x = np.vstack([x1, x2]).T
y = np.random.random_sample(size=(1000,))
print("x shape", x.shape)
print("y shape", y.shape)
weights = cp.Variable(x.shape[1])
objective = cp.sum_squares(x @ weights - y)
minimize = cp.Minimize(objective)
constraints = [weights >= 0, cp.sum(weights) == 1.0]
problem = cp.Problem(minimize, constraints)
problem.solve(verbose=True)
print(problem.value)
print(weights.value)
I'll add that cvxpy
is actively managed and the team there seems very responsive.我要补充一点, cvxpy
是积极管理的,那里的团队似乎反应迅速。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.