简体   繁体   English

对 Python 上的每个系数具有特定约束的多元线性回归

[英]Multiple Linear Regression with specific constraint on each coefficients on Python

I am currently running multiple linear regression on a dataset.我目前正在数据集上运行多元线性回归。 At first, I didn't realize I needed to put constraints over my weights;起初,我没有意识到我需要限制我的体重; as a matter of fact, I need to have specific positive & negative weights.事实上,我需要有特定的正负权重。

To be more precise, I am doing a scoring system and this is why some of my variables should have a positive or negative impact on the note.更准确地说,我正在做一个评分系统,这就是为什么我的一些变量应该对笔记产生积极或消极的影响。 Yet, when running my model, the results do not fit what I am expecting, some of my 'positive' variables get negative coefficients and vice versa.然而,在运行我的模型时,结果不符合我的预期,我的一些“正”变量得到负系数,反之亦然。

As an example, let's suppose my model is :例如,假设我的模型是:

y = W0*x0 + W1*x1 + W2*x2 

Where x2 is a 'positive' variable, I would like to put a constraint over W2 to be positive !其中 x2 是一个“正”变量,我想将 W2 的约束设为正值!

I have been looking around a lot about this issue but I've not found anything about constraints on specific weights/coefficients, all that I've found is about setting all coefficients positive or summing them to one.我一直在寻找关于这个问题的很多东西,但我没有发现任何关于特定权重/系数的约束,我发现的只是将所有系数设置为正数或将它们相加为一。

I am working on Python using the ScikitLearn packages.我正在使用 ScikitLearn 包研究 Python。 This is how I get my best model :这就是我获得最佳模型的方式:

def ridge(Xtrain, Xtest, Ytrain, Ytest, position):
    param_grid={'alpha':[0.01 , 0.1, 1, 10, 50, 100, 1000]}
    gs = grid_search.GridSearchCV(Ridge(), param_grid=param_grid, n_jobs=-1, cv=3)
    gs.fit(Xtrain, Ytrain)
    hatytrain = gs.predict(Xtrain)
    hatytest = gs.predict(Xtest)

Any idea of how I could assign a constraint on the coefficient of a specific variable ?知道如何为特定变量的系数分配约束吗? Probably going to be burdensome to define each constraint but I have no idea how to do otherwise.定义每个约束可能会很麻烦,但我不知道该怎么做。

Scikit-learn does not allow such constraints on the coefficients. Scikit-learn 不允许对系数进行此类约束。

But you can impose any constraints on coefficients and optimize the loss with coordinate descent if you implement your own estimator .但是,如果您实现自己的 estimator ,则可以对系数施加任何约束并通过坐标下降优化损失。 In the unconstraint case, coordinate descent produces the same result as OLS in reasonable number of iterations.在无约束情况下,坐标下降在合理的迭代次数中产生与 OLS 相同的结果。

I've written a class that imposes upper and lower bounds on LinearRegression coefficients.我编写了一个对线性回归系数施加上限和下限的类。 You can extend it to use Ridge or evel Lasso penalty if you want:如果需要,您可以将其扩展为使用 Ridge 或 evel Lasso 惩罚:

from sklearn.linear_model.base import LinearModel
from sklearn.base import RegressorMixin
from sklearn.utils import check_X_y
import numpy as np

class ConstrainedLinearRegression(LinearModel, RegressorMixin):

    def __init__(self, fit_intercept=True, normalize=False, copy_X=True, nonnegative=False, tol=1e-15):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.nonnegative = nonnegative
        self.tol = tol

    def fit(self, X, y, min_coef=None, max_coef=None):
        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], y_numeric=True, multi_output=False)
        X, y, X_offset, y_offset, X_scale = self._preprocess_data(
            X, y, fit_intercept=self.fit_intercept, normalize=self.normalize, copy=self.copy_X)
        self.min_coef_ = min_coef if min_coef is not None else np.repeat(-np.inf, X.shape[1])
        self.max_coef_ = max_coef if max_coef is not None else np.repeat(np.inf, X.shape[1])
        if self.nonnegative:
            self.min_coef_ = np.clip(self.min_coef_, 0, None)

        beta = np.zeros(X.shape[1]).astype(float)
        prev_beta = beta + 1
        hessian = np.dot(X.transpose(), X)
        while not (np.abs(prev_beta - beta)<self.tol).all():
            prev_beta = beta.copy()
            for i in range(len(beta)):
                grad = np.dot(np.dot(X,beta) - y, X)
                beta[i] = np.minimum(self.max_coef_[i], 
                                     np.maximum(self.min_coef_[i], 
                                                beta[i]-grad[i] / hessian[i,i]))

        self.coef_ = beta
        self._set_intercept(X_offset, y_offset, X_scale)
        return self    

You can use this class, for example, to make all coefficients non-negative例如,您可以使用此类使所有系数非负

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X, y = load_boston(return_X_y=True)
model = ConstrainedLinearRegression(nonnegative=True)
model.fit(X, y)
print(model.intercept_)
print(model.coef_)

This produces an output like这会产生类似的输出

-36.99292986145538
[0.         0.05286515 0.         4.12512386 0.         8.04017956
 0.         0.         0.         0.         0.         0.02273805
 0.        ]

You can see that most coefficients are zero.您可以看到大多数系数为零。 An ordinary LinearModel would have made them negative:一个普通的 LinearModel 会使它们成为负数:

model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)

which would return to you哪个会回到你身边

36.49110328036191
[-1.07170557e-01  4.63952195e-02  2.08602395e-02  2.68856140e+00
 -1.77957587e+01  3.80475246e+00  7.51061703e-04 -1.47575880e+00
  3.05655038e-01 -1.23293463e-02 -9.53463555e-01  9.39251272e-03
 -5.25466633e-01]

You can also impose arbitrary bounds for any coefficients you choose - that's what you asked for.您还可以为您选择的任何系数强加任意界限 - 这就是您的要求。 For example, in this setup例如,在这个设置中

model = ConstrainedLinearRegression()
min_coef = np.repeat(-np.inf, X.shape[1])
min_coef[0] = 0
min_coef[4] = -1
max_coef = np.repeat(4, X.shape[1])
max_coef[3] = 2
model.fit(X, y, max_coef=max_coef, min_coef=min_coef)
print(model.intercept_)
print(model.coef_)

you would get an output你会得到一个输出

24.060175576410515
[ 0.          0.04504673 -0.0354073   2.         -1.          4.
 -0.01343263 -1.17231216  0.2183103  -0.01375266 -0.7747823   0.01122374
 -0.56678676]

Update .更新 This solution can be adapted to work with constraints on linear combinations of the coeffitients (eg their sum) - in this case, individual constraints for each coefficient would be recalculated on each step.该解决方案可以适用于对系数线性组合(例如它们的总和)的约束——在这种情况下,每个系数的单独约束将在每个步骤中重新计算。 This Github gist provides an example. 这个 Github 要点提供了一个例子。

scikit-learn 的0.24.2版本中,您可以通过对LinearRegression使用参数positive=True强制算法使用正系数,通过将您想要负系数的列乘以 -1,您应该得到您想要的.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM