简体   繁体   中英

In Linear Regression Modeling why my RMSE Value is so large?

This is my dataset and Median_Price is my target variable RMSE VALUE before and after using GridSearch CV parameter tuning is attached in the code. How Can I decrease the RMSE based on my dataset??

Dataset is to download from google drive here and also I had added a picture of the dataset for understanding.

在此处输入图像描述

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline

dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')

dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)

dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)

dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)

X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]

y = dataset['Median_Price']

from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
    scores = cross_val_score(model,
                             X_train,
                             y_train,
                             cv=5,
                             scoring='neg_mean_squared_error')

    print('CV Mean: ', np.mean(scores))
    print('STD: ', np.std(scores))
    print('\n')

regressor = LinearRegression()
regressor.fit(X_train, y_train)

# get cross val scores
get_cv_scores(regressor)

from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)

# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438


coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]

Dataset CSV download link

Well, there seems to be a certain decrease in the RMSE value after using GridSearchCV.

You can try out the feature selection, feature engineering, scale your data, transformations, try some other algorithms, these might help you decrease your RMSE value to some extent.

Also, the RMSE value depends completely on the context of data. Seems your data points are separated far from each other which is giving you very high RMSE value. The different techniques I mentioned above can help you to decrease RMSE only to a limited extent.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM