如何正确使用sklearn来预测拟合误差

Question

I'm using sklearn to fit a linear regression model to some data. 我正在使用sklearn将线性回归模型拟合到一些数据。 In particular, my response variable is stored in an array y and my features in a matrix X . 特别是，我的响应变量存储在数组y ，特征存储在矩阵X 。

I train a linear regression model with the following piece of code 我用以下代码训练线性回归模型

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X,y)

and everything seems to be fine. 一切似乎都很好。

Then let's say I have some new data X_new and I want to predict the response variable for them. 然后，假设我有一些新数据X_new ，我想预测它们的响应变量。 This can easily done by doing 这样做很容易做到

    predictions = model.predict(X_new)

My question is, what is this the error associated to this prediction? 我的问题是，与该预测相关的误差是什么？ From my understanding I should compute the mean squared error of the model: 根据我的理解，我应该计算模型的均方误差：

    from sklearn.metrics import mean_squared_error
    model_mse = mean_squared_error(model.predict(X),y)

And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse . 基本上，我对新数据的真实预测应该是根据具有均值predictions和sigma ^ 2 = model_mse的高斯分布计算出的随机数。 Do you agree with this and do you know if there's a faster way to do this in sklearn ? 您是否同意这一点，并且您知道在sklearn是否有更快的方法？

Answer 1

You probably want to validate your model on your training data set. 您可能想在训练数据集上验证模型。 I would suggest exploring the cross-validation submodule sklearn.cross_validation . 我建议探索交叉验证子模块sklearn.cross_validation 。

The most basic usage is: 最基本的用法是：

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Answer 2

It depends on you training data- If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC ), then I would generally agree. 这取决于您的训练数据-如果其分布可以很好地表示“现实世界”并且具有足够的大小（请参阅学习理论，如PAC ），那么我通常会同意。

That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested? 也就是说-如果您正在寻找评估模型的实用方法，为什么不使用Kris建议的测试集？ I usually use grid search for optimizing parameters: 我通常使用网格搜索来优化参数：

#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)

#cross validation gridsearch 
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)

#print scores and best estimator
print 'best param: ', grid_search.best_params_ 
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)

The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data. 这个想法对您的学习算法（以及您自己）隐藏了测试集-不要使用此数据来训练和优化参数。

Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse. 最后，您应该仅将测试集用于性能评估（错误），它应提供无偏的mse。

如何正确使用sklearn来预测拟合误差

问题描述

2 个解决方案

解决方案1
0 2016-02-06 21:35:22

解决方案2
0 2016-02-07 05:13:08

如何正确使用sklearn来预测拟合误差

问题描述

2 个解决方案

解决方案1 0 2016-02-06 21:35:22

解决方案2 0 2016-02-07 05:13:08

解决方案1
0 2016-02-06 21:35:22

解决方案2
0 2016-02-07 05:13:08