I have 34 samples with 4 inputs and one output
in excel file. I am doing prediction using gradient boost regressor (GBR)
and I want to find the optimum parameters
for GBR using grid search method
from Sklearn
using cross validation
to split the data. I have implemented this code to tune the GBR parameters but I got this error below. In fact, this code was for a classification problem using XGB
and I modified this code to fit my regression problem. Please can you help me to fix this error? is it what I have done correct or not?
The error that I have got:
ValueError Traceback (most recent call last)
<ipython-input-5-4ee3b80c1f07> in <module>()
23 kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
24 grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
---> 25 grid_result = grid_search.fit(X, label_encoded_y)
26 # summarize results
27 print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
D:\Anconda\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
--> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, "out" will contain train score info
D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
330 n_samples))
331
--> 332 for train, test in super(_BaseKFold, self).split(X, y, groups):
333 yield train, test
334
D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
93 X, y, groups = indexable(X, y, groups)
94 indices = np.arange(_num_samples(X))
---> 95 for test_index in self._iter_test_masks(X, y, groups):
96 train_index = indices[np.logical_not(test_index)]
97 test_index = indices[test_index]
D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _iter_test_masks(self, X, y, groups)
632
633 def _iter_test_masks(self, X, y=None, groups=None):
--> 634 test_folds = self._make_test_folds(X, y)
635 for i in range(self.n_splits):
636 yield test_folds == i
D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _make_test_folds(self, X, y)
597 raise ValueError("n_splits=%d cannot be greater than the"
598 " number of members in each class."
--> 599 % (self.n_splits))
600 if self.n_splits > min_groups:
601 warnings.warn(("The least populated class in y has only %d"
ValueError: n_splits=2 cannot be greater than the number of members in each class.
This is below my try
# XGB, Tune n_estimators and max_depth
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor,GradientBoostingRegressor,
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn import ensemble
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from IPython.core.interactiveshell import InteractiveShell
matplotlib.use('Agg')
from matplotlib import pyplot
import numpy as np
#read data
Data_ini = pd.read_excel('Data - 1 output -Ra-in - Crossvalidation.xlsx').iloc[:,:] #read data
#encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = GradientBoostingRegressor()
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators_vs_max_depth.png')
You get this error because you are using StratifiedKFold
for a regression problem. From its documentation
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
You will receive ValueError
when none of the class (in a regression problem, the target value) has more than one instance. You can reproduce this error by
import numpy as np
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
kfold.split(x, y)
You won't get this error if you let one of the class to have one more instance
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
y[1] = 5
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
kfold.split(x, y)
To get your code functions properly, you only need to replace StratifiedKFold
with Kfold
.
EDIT
Because neg_log_loss
requires predict_proba
which is not implemented in GradientBoostingRegressor
, it cannot be used as a scoring function. Essentially, since you are training a model for regression, use neg_mean_absolute_error
or other metrics for regression listed here
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.