对数据过度拟合 Lasso 回归模型

Question

我想创建一个模型来描述我的数据的行为。 我尝试了简单线性回归、简单多项式回归和带正则化和交叉验证的多项式回归。

我发现最后一种方法允许自动选择特征（带度数），这是我真正需要的，因为简单的线性回归表现不佳。 我按照这个解释用套索正则化和交叉验证来执行多项式回归。

在此示例中，此方法用于避免在使用简单多项式回归时发生的过度拟合。 然而，在我的情况下，反之亦然会导致过度拟合。

我想知道是否有人可以帮助我理解我在代码实现中做错了什么？ 或者也许有更好的解决方案来将数据最佳地拟合到模型中？

代码（使用 statsmodels 进行线性回归，使用 scikit learn 进行多项式回归）：

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# Import function to automatically create polynomial features 
from sklearn.preprocessing import PolynomialFeatures

# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV

#Initial data
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
                'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949, 0.026217822, 0.032549629, 0.025744309, 0.027945824],
                'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131, 196.610972, 194.915587, 183.209067, 182.41669],
                'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046, 1777.880947, 1792.21646, 1785.653845, 1788.401923]        
                }

Test = {'Cycle': [33, 34, 35],
                'Internal_Resistance': [0.027332509, 0.027960729, 0.028969193],
                'CV_Capacity': [204.018257, 179.929472, 189.576431],
                'Full_Capacity': [1782.983718, 1793.939504, 1788.67233]        
                }

#Initial data presented in a form of a data frame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head(32))
print()
print(X)
print()

#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
    df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)

# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant 
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)  
print_model = model.summary()
print(print_model)
print()



# Fitting data with scikit learn - simple linear regression    
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)

#Metrics of the linear model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)

#Coefficients for the linear model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))

#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)




#Fitting data with a simple polynomial model  
poly = PolynomialFeatures(2,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_poly_feature_name)
print(len(X_poly_feature_name))

df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())

df_poly['y']=df['Full_Capacity']
print(df_poly.head())

X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']

poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)

#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_poly, y_train)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_poly, y_train)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))

coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)





#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))

#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_pred1, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_pred1, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)

coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)

print ("R2 value of the new polynomial model:",model1.score(X_train,y_pred1))
print ("Alpha of the new polynomial model:",model1.alpha_)

print(coeff1[coeff1['Coefficients Metamodel']!=0])

plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2) ```

Answer 1

我发现具有单个交互项的简单多项式可以给出合适的拟合。 请注意，未使用“循环”的 SoH 数据的 3D 散点图表明，有些区域可以从附加数据中受益，以表征响应面：

a = 1.6708148450040499E+03
b = 6.5825133247934986E-01
c = 4.8477389499541523E+03
d = -2.7015882838321772E+01

temp = a
temp += b * CV_Capacity
temp += c * Internal_Resistance
temp += d * Internal_Resistance * CV_Capacity
return temp

Answer 2

Lasso 是一种正则化方法，可用于避免过度拟合。

在这种方法中，我们向损失函数添加了一项，这是对权重的一种约束。 然后，您的损失函数中有 2 个项 - 一个负责拟合数据的项和一个正则化项。

此外，您还有一个常数来控制这两项之间的权衡。 在您的情况下，您可能应该增加正则化项的强度（增加常数），以避免过度拟合。

对数据过度拟合 Lasso 回归模型

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-01-25 18:41:08

解决方案2
0 2020-01-25 18:50:42

对数据过度拟合 Lasso 回归模型

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-01-25 18:41:08

解决方案2 0 2020-01-25 18:50:42

解决方案1
0 已采纳 2020-01-25 18:41:08

解决方案2
0 2020-01-25 18:50:42