简体   繁体   English

sklearn 中的 RandomForestRegressor 给出负分

[英]RandomForestRegressor in sklearn giving negative scores

I'm surprised that i get a negative score on my predictions using the RandomForestRegressor, I'm using the default scorer(coefficient of determination).我很惊讶我使用 RandomForestRegressor 的预测得到了负分,我使用的是默认记分器(确定系数)。 any help will be appreciated.任何帮助将不胜感激。 my dataset looks something like this.我的数据集看起来像这样。 dataset screenshot here数据集截图在这里

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score,RandomizedSearchCV,train_test_split
import numpy as np,pandas as pd,pickle
dataframe = pd.read_csv("../../notebook/car-sales.csv")
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price" , axis = 1)
cat_features = [
    "Make",
    "Colour",
    "Doors",
]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([
("onehot" ,oneencoder, cat_features)
],remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])
x_train , x_test , y_train,y_test = train_test_split(transformered_x , y , test_size = .2)
regressor = RandomForestRegressor(n_estimators=100)
regressor.fit(x_train , y_train)
regressor.score(x_test , y_test)

I modified your code just a little bit and was able to achieve a score of 89%.我稍微修改了您的代码,并且能够达到 89% 的分数。 You were SO close.你是如此接近。 Nicely done on your part.你做得很好。 Not shabby!不破旧!

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import pandas as pd
dataframe = pd.read_csv("car-sales.csv")
df.head()
y = dataframe["Price"].str.replace("[\$\.\,]" , "").astype(int)
x = dataframe.drop("Price", axis=1)
cat_features = ["Make", "Colour", "Odometer", "Doors", ]
oneencoder = OneHotEncoder()
transformer = ColumnTransformer([("onehot", oneencoder, cat_features)], remainder="passthrough")
transformered_x = transformer.fit_transform(x)
transformered_x = pd.get_dummies(dataframe[cat_features])

x_train, x_test, y_train, y_test = train_test_split(transformered_x, y, test_size=.2, random_state=3)

forest = RandomForestRegressor(n_estimators=200, criterion="mse", min_samples_leaf=3, min_samples_split=3, max_depth=10)

forest.fit(x_train, y_train)

# Explained variance score: 1 is perfect prediction
print('Score: %.2f' % forest.score(x_test, y_test, sample_weight=None))
print(forest.score(x_test, y_test))

I think it was negative due to extreme over-fitting due to an extremely small amount of data.我认为这是负面的,因为由于数据量极少而导致过度拟合。

This is directly from the sklearn documentation:这直接来自 sklearn 文档:

and I quote the document:我引用文件:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of 
squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares 
((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it 
can be negative (because the model can be arbitrarily worse). A constant model 
that always predicts the expected value of y, disregarding the input features, 
would get a R^2 score of 0.0.

I enlarged the dataset to 100 rows, dropped the surrogate key (first column having int id 0-99) and here it is:我将数据集扩大到 100 行,删除了代理键(第一列的 int id 为 0-99),这里是:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM