简体   繁体   中英

Finding Root Mean Squared Error with Pandas dataframe

I am trying to calculate the root mean squared error in from a pandas data frame. I have checked out previous links on stacked overflow such as Root mean square error in python and the scikit learn documentation http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html I was hoping someone out there would shed some light on what I am doing wrong. Here is the dataset . Here is my code.

import pandas as pd
import numpy as np
sales = pd.read_csv("home_data.csv")

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(sales,train_size=0.8)

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
#print the y intercept
print(lm.intercept_)
#print the coefficents
print(lm.coef_)

lm.predict(300)



from math import sqrt
from sklearn.metrics import mean_squared_error
y_true=train_data.price.loc[0:5,]
test_data=test_data[['price']].reset_index()
y_pred=test_data.price.loc[0:5,]
predicted =y_pred.as_matrix()
actual= y_true.as_matrix()
mean_squared_error(actual, predicted)

EDIT

So this is what worked for me. I had to transform the test dataset values for sqft living from row to column.

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)

New code

test_X = test_data.sqft_living.values
print(test_X)
print(np.shape(test_X))
print(len(test_X))
test_X = np.reshape(test_X, [4323, 1])
print(test_X)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
MSE = mean_squared_error(y_true = test_data.price.values, y_pred = lm.predict(test_X))
MSE
MSE**(0.5)

You're comparing test-set labels to training-set labels. I believe that what you actually want to do is compare test-set labels to predicted test-set labels.

For example:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

sales = pd.read_csv("home_data.csv")
train_data, test_data = train_test_split(sales,train_size=0.8)

# Train the model
X = train_data[['sqft_living']]
y = train_data.price
lm = LinearRegression()
lm.fit(X, y)

# Predict on the test data
X_test = test_data[['sqft_living']]
y_test = test_data.price
y_pred = lm.predict(X_test)

# Compute the root-mean-square
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)
# 260435.511036

Note that scikit-learn can in general handle Pandas DataFrames and Series inputs without explicit conversion to numpy arrays. The error in the code snippet in your question has to do with the fact that the two arrays passed to mean_squared_error() are different sizes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM