简体   繁体   English

如何使用三个自变量拟合线性回归模型并使用sklearn计算均方误差?

[英]How to fit a linear regression model using three independent variables and calculate the mean squared error using sklearn?

I'm trying to fit a linear regression model using three independent variables and calculate the mean squared error using sklearn, but I seem not be able to get it right. 我正在尝试使用三个自变量拟合线性回归模型,并使用sklearn计算均方误差,但我似乎无法做到正确。

My data is the Boston Housing, and three independent variables are as follow: 1. CRIM (per capita crime rate by town) 2. RM (average number of rooms per dwelling) 3. PTRATIO (pupil-teacher ratio by town) 我的数据是波士顿住房,三个独立变量如下:1。CRIM(城镇人均犯罪率)2。RM(每个住宅的平均房间数)3。PTRATIO(城镇的学生 - 教师比率)

Fit model: 适合型号:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
lm = LinearRegression()
lm.fit(X[['CRIM']['RM'], ['PTRATIO']], boston_df.PRICE)

Calculate the mean square error 计算均方误差

from sklearn.metrics import mean_squared_error
y_true = ['CRIM', 'RM', 'PTRATIO']
y_pred = ['PRICE']
mean_squared_error(y_true, y_pred)

Any advice or hints are much appreciated! 任何建议或提示都非常感谢!

Try X[['CRIM', 'RM', 'PRTATIO']] instead of X[['CRIM']['RM'], ['PTRATIO']] for fitting the model 尝试X[['CRIM', 'RM', 'PRTATIO']]而不是X[['CRIM']['RM'], ['PTRATIO']]来拟合模型

For prediction you need to compare these two vectors: 对于预测,您需要比较这两个向量:

y_true = boston_df.PRICE
y_pred = lm.predict(X[['CRIM', 'RM', 'PRTATIO']])
mean_squared_error(y_true, y_pred)

Basically your y_pred should be the predicted values from your model which is lm in this case. 基本上你的y_pred应该是模型中的预测值,在这种情况下是lm

sklearn has great documentation. sklearn有很棒的文档。 Here is a super thorough example complete with example data set: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html 这是一个完整的示例数据集的完整示例: http//scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

The biggest problem that you are having is your data set. 您遇到的最大问题是您的数据集。 Like your code here: 就像你的代码一样:

y_true = ['CRIM', 'RM', 'PTRATIO']
y_pred = ['PRICE']

That isn't even real data, it is just 2 lists of string labels, so of course this won't work: 这甚至不是真正的数据,它只是2个字符串标签列表,所以当然这不起作用:

mean_squared_error(y_true, y_pred)

From the example I posted, you could try this "hello world" type code (using existing data sets) just to make sure you are getting the code working, then all you need to do is replace the dataset with your own data. 从我发布的示例中,您可以尝试这个“hello world”类型代码(使用现有数据集)只是为了确保您使代码正常工作,然后您需要做的就是用您自己的数据替换数据集。 As you can see most of the code is dedicated to preparing the data so it will load correctly into the linear regression function: 正如您所看到的,大多数代码专用于准备数据,因此它将正确加载到线性回归函数中:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

print("Mean squared error: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM