I'm surprised that i get a negative score on my predictions using the RandomForestRegressor, I'm using the default scorer(coefficient of determination). any help will be appreciated. my dataset looks something like this. dataset screenshot here
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score,RandomizedSearchCV,train_test_split
import numpy as np,pandas as pd,pickle
dataframe = pd.read_csv("../../notebook/car-sales.csv")
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price" , axis = 1)
cat_features = [
"Make",
"Colour",
"Doors",
]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([
("onehot" ,oneencoder, cat_features)
],remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])
x_train , x_test , y_train,y_test = train_test_split(transformered_x , y , test_size = .2)
regressor = RandomForestRegressor(n_estimators=100)
regressor.fit(x_train , y_train)
regressor.score(x_test , y_test)
I modified your code just a little bit and was able to achieve a score of 89%. You were SO close. Nicely done on your part. Not shabby!
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import pandas as pd
dataframe = pd.read_csv("car-sales.csv")
df.head()
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price", axis=1)
cat_features = ["Make", "Colour", "Odometer", "Doors", ]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([("onehot", oneencoder, cat_features)], remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])
x_train, x_test, y_train, y_test = train_test_split(transformered_x, y, test_size=.2, random_state=3)
forest = RandomForestRegressor(n_estimators=200, criterion="mse", min_samples_leaf=3, min_samples_split=3, max_depth=10)
forest.fit(x_train, y_train)
# Explained variance score: 1 is perfect prediction
print('Score: %.2f' % forest.score(x_test, y_test, sample_weight=None))
print(forest.score(x_test, y_test))
I think it was negative due to extreme over-fitting due to an extremely small amount of data.
This is directly from the sklearn documentation:
and I quote the document:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of
squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares
((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it
can be negative (because the model can be arbitrarily worse). A constant model
that always predicts the expected value of y, disregarding the input features,
would get a R^2 score of 0.0.
I enlarged the dataset to 100 rows, dropped the surrogate key (first column having int id 0-99) and here it is:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.