简体   繁体   English

R正则化回归系数

[英]R regularize coefficients in regression

I'm trying to use linear regression to figure out the best weighting for 3 models to predict an outcome. 我正在尝试使用线性回归来找出3种模型的最佳权重,以预测结果。 So there are 3 variables (x1, x2, x3) that are the predictions of the dependent variable, y . 因此,有3个变量(x1, x2, x3)是因变量y的预测。 My question is, how do I run a regression with the constraint that the sum of the coefficients sum to 1. For example: 我的问题是,如何在系数之和为1的约束下进行回归。例如:

this is good: 这很好:

y = .2(x1) + .4(x2) + .4(x3) 

since .2 + .4 + .4 = 1 因为.2 + .4 + .4 = 1

this is no good: 这不好:

y = 1.2(x1) + .4(x2) + .3(x3)

since 1.2 + .4 + .3 > 1 1.2 + .4 + .3 > 1

I'm looking to do this in R if possible. 我希望在R中做到这一点。 Thanks. 谢谢。 Let me know if this needs to get moved to the stats area ('Cross-Validated'). 让我知道是否需要移至统计信息区域(“交叉验证”)。

EDIT: 编辑:

The problem is to classify each row as 1 or 0. y is the actual values ( 0 or 1 ) from the training set, x1 is the predicted values from a kNN model, x2 is from a randomForest, x3 is from a gbm model. 问题是将每一行分类为1或0。y是训练集中的实际值(0或1),x1是来自kNN模型的预测值,x2是来自randomForest,x3是来自gbm模型。 I'm trying to get the best weightings for each model, so each coefficient is <=1 and the sum of the coefficients == 1. Would look something like this: 我正在尝试为每个模型获得最佳权重,因此每个系数<= 1且系数之和==1。看起来像这样:

y/Actual value       knnPred      RfPred     gbmPred
      0                .1111       .0546       .03325
      1                .7778       .6245       .60985
      0                .3354       .1293       .33255
      0                .2235       .9987       .10393
      1                .9888       .6753       .88933
     ...                 ...         ...         ...

The measure for success is AUC. 成功的标准是AUC。 So I'm trying to set the coefficients to maximize AUC while making sure they sum to 1. 因此,我尝试设置系数以使AUC最大化,同时确保其总和为1。

There's very likely a better way that someone else will share, but you're looking for two parameters such that 很有可能其他人可以共享,但是您正在寻找两个参数,例如

b1 * x1 + b2 * x2 + (1 - b1 - b2) * x3

is close to y . 接近y To do that, I'd write an error function to minimize 为此,我将编写一个错误函数以最小化

minimizeMe <- function(b, x, y) {  ## Calculates MSE
    mean((b[1] * x[, 1] + b[2] * x[, 2] + (1 - sum(b)) * x[, 3] - y) ^ 2)
}

and throw it to optim 扔给optim

fit <- optim(par = c(.2, .4), fn = minimizeMe, x = cbind(x1, x2, x3), y = y)

No data to test on: 没有要测试的数据:

mod1 <- lm(y ~ 0+x1+x2+x3, data=dat)
mod2 <- lm(y/I(sum(coef(mod1))) ~ 0+x1+x2+x3, data=dat)

And now that I think about it some more, skip mod2, just: 现在,我再考虑一下,跳过mod2,只是:

coef(mod1)/sum(coef(mod1))

For the five rows shown either of round(knnPred) or round(gbmPred) give perfect predictions so there is some question whether more than one predictor is needed. 对于所示的五行, round(knnPred)round(gbmPred)给出了完美的预测,因此存在一个问题,即是否需要多个预测变量。

At any rate, to solve the given question as stated the following will give nonnegative coefficients that sum to 1 (except possibly for tiny differences due to computer arithmetic). 无论如何,为解决上述给定的问题,以下将给出非负系数的总和为1(可能由于计算机算法而导致的细微差异除外)。 a is the dependent variable and b is a matrix of independent variables. a是因变量, b是自变量矩阵。 c and d define the equality constraint (coeffs sum to 1) and e and f define the inequality constraints (coeffs are nonnegative). cd定义等式约束(coeffs总和为1), ef定义不等式约束(coeffs为非负数)。

library(lsei)
a <- cbind(x1, x2, x3)
b <- y
c <- matrix(c(1, 1, 1), 1)
d <- 1
e <- diag(3)
f <- c(0, 0, 0)
lsei(a, b, c, d, e, f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM