简体   繁体   中英

RandomForestRegressor in sklearn giving negative scores

I'm surprised that i get a negative score on my predictions using the RandomForestRegressor, I'm using the default scorer(coefficient of determination). any help will be appreciated. my dataset looks something like this. dataset screenshot here

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score,RandomizedSearchCV,train_test_split
import numpy as np,pandas as pd,pickle
dataframe = pd.read_csv("../../notebook/car-sales.csv")
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price" , axis = 1)
cat_features = [
    "Make",
    "Colour",
    "Doors",
]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([
("onehot" ,oneencoder, cat_features)
],remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])
x_train , x_test , y_train,y_test = train_test_split(transformered_x , y , test_size = .2)
regressor = RandomForestRegressor(n_estimators=100)
regressor.fit(x_train , y_train)
regressor.score(x_test , y_test)

I modified your code just a little bit and was able to achieve a score of 89%. You were SO close. Nicely done on your part. Not shabby!

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import pandas as pd
dataframe = pd.read_csv("car-sales.csv")
df.head()
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price", axis=1)
cat_features = ["Make", "Colour", "Odometer", "Doors", ]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([("onehot", oneencoder, cat_features)], remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])

x_train, x_test, y_train, y_test = train_test_split(transformered_x, y, test_size=.2, random_state=3)

forest = RandomForestRegressor(n_estimators=200, criterion="mse", min_samples_leaf=3, min_samples_split=3, max_depth=10)

forest.fit(x_train, y_train)

# Explained variance score: 1 is perfect prediction
print('Score: %.2f' % forest.score(x_test, y_test, sample_weight=None))
print(forest.score(x_test, y_test))

I think it was negative due to extreme over-fitting due to an extremely small amount of data.

This is directly from the sklearn documentation:

and I quote the document:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of 
squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares 
((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it 
can be negative (because the model can be arbitrarily worse). A constant model 
that always predicts the expected value of y, disregarding the input features, 
would get a R^2 score of 0.0.

I enlarged the dataset to 100 rows, dropped the surrogate key (first column having int id 0-99) and here it is:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM