简体   繁体   中英

overfitting reduction in machine learning

Guys i'm doing machine learning from coursera by andrew ng. In one of the lectures he described how we can prevent overfitting by modifying the cost function . My question is in the below code we are adding two terms at the last and reducing the value of theta3 and theta4. so why are we exactly adding those terms , i mean we can reduce the value of theta3 and theta4 only and it will reduce the value of our cost function .

minθ 1/2m∑mi=1(hθ(x(i))−y(i))^2+1000*(θ3)^2+1000*(θ4)^2

Usually when we want to fit a model, it is intuitive to try to add as many features as possible to try and find a mapping from features to expected outputs. Adding too many features, especially non-linear ones may overfit the data.

Therefore, regularization (ridge regression in this case) allows us to keep all of the parameters, but ensure that their magnitudes are as small as possible to ensure that the overall cost function output for the fitted parameters is low. With the smaller magnitudes of the parameters, this enforces a simpler prediction model so that it can better generalize to new inputs the model has not seen before.

As you can see, your loss function now includes two elements. The first set of terms is the standard one where we minimize the sum of squared errors between the predicted and expected values. The second set of terms are known as the regularization terms. It may look bizarre but it does make sense. This sums over squared parameters and multiplies it by another parameter, which is usually λ but in your case you set this to 1000. The reason for doing that is to "punish" the loss function for high values of the parameters. From what I said before, simple models are better than complex models and usually do not overfit. Therefore, we need to try and simplify the model as much as possible. Remember that the procedure for finding these parameter values is through gradient descent, and it is an iterative process to minimize the loss function. By punishing the parameter values, we add a constraint to minimize them as much as possible.

λ is thus a hyperparameter and should be subject to tuning. Making the value too small would be symptomatic of overfitting. Making the value too large would mean that you are making all of the weights of the parameters be small to ensure the cost function is minimized which means that you would be underfitting. Finding the right value to apply to each squared parameter term in the cost function requires experimentation and seeing what the cost function trend looks like over time. You choose the right one that has the right balance of not converging too quickly, but at the same time the cost function output is as low as possible.

As further reading, this link provides some more intuition on how regularization works and it covers both ridge regression and LASSO regression, where instead of the sum of squared parameters it is the sum of absolute parameters.

https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM