简体   繁体   English

在线性回归建模中,为什么我的 RMSE 值这么大?

[英]In Linear Regression Modeling why my RMSE Value is so large?

This is my dataset and Median_Price is my target variable RMSE VALUE before and after using GridSearch CV parameter tuning is attached in the code.这是我的数据集, Median_Price是我在使用 GridSearch 前后的目标变量 RMSE VALUE 代码中附有 CV 参数调整。 How Can I decrease the RMSE based on my dataset??如何根据我的数据集降低 RMSE?

Dataset is to download from google drive here and also I had added a picture of the dataset for understanding.数据集是从 google drive here下载的,我还添加了数据集的图片以供理解。

在此处输入图像描述

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline

dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')

dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)

dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)

dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)

X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]

y = dataset['Median_Price']

from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
    scores = cross_val_score(model,
                             X_train,
                             y_train,
                             cv=5,
                             scoring='neg_mean_squared_error')

    print('CV Mean: ', np.mean(scores))
    print('STD: ', np.std(scores))
    print('\n')

regressor = LinearRegression()
regressor.fit(X_train, y_train)

# get cross val scores
get_cv_scores(regressor)

from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)

# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438


coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]

Dataset CSV download link数据集 CSV 下载链接

Well, there seems to be a certain decrease in the RMSE value after using GridSearchCV.嗯,使用 GridSearchCV 后 RMSE 值似乎有一定的下降。

You can try out the feature selection, feature engineering, scale your data, transformations, try some other algorithms, these might help you decrease your RMSE value to some extent.您可以尝试特征选择、特征工程、缩放数据、转换,尝试一些其他算法,这些可能会在一定程度上帮助您降低 RMSE 值。

Also, the RMSE value depends completely on the context of data.此外,RMSE 值完全取决于数据的上下文。 Seems your data points are separated far from each other which is giving you very high RMSE value.似乎您的数据点彼此分开很远,这为您提供了非常高的 RMSE 值。 The different techniques I mentioned above can help you to decrease RMSE only to a limited extent.我上面提到的不同技术只能在有限程度上帮助您降低 RMSE。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM