I am trying to implement a random forest regressor using scikit-learn pipes and nested cross-validation. The dataset is about housing prices, with several features (some numeric other categorical) and a continuous target variable (median_house_value).
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
I decided to manually create two stratified 5-folds splits (inner, outer loop for nested cv). The stratification is based upon a modified version of the median_income feature:
df.insert(9, "income_cat",
pd.cut(df["median_income"],bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1,2,3,4,5]))
This is the code for the splitting folds
cv1_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 42)
cv1_splits = []
# create first 5 stratified folds indices
for train_index, test_index in cv1_5.split(df, df["income_cat"]):
cv1_splits.append((train_index, test_index))
cv2_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 43)
cv2_splits = []
# create second 5 stratified folds indices
for train_index, test_index in cv2_5.split(df, df["income_cat"]):
cv2_splits.append((train_index, test_index))
# set initial dataset
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()
This is the preprocess pipe
# create preprocess pipe
preprocess_pipe = Pipeline(
[
("ctransformer", ColumnTransformer([
(
"num_pipe",
Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
]),
list(X.select_dtypes(include=[np.number]))
),
(
"cat_pipe",
Pipeline([
("encoder", OneHotEncoder()),
]),
["ocean_proximity"])
])
),
]
)
And this is the final pipe (including the preprocess one)
pipe = Pipeline([
("preprocess", preprocess_pipe),
("model", RandomForestRegressor())
])
I am using nested cross validation to tune the hyperparameters of the final pipe and compute the generalization error
Here is the parameter grid
param_grid = [
{
"preprocess__ctransformer__num_pipe__imputer__strategy": ["mean","median"],
"model__n_estimators": [3, 10, 30, 50, 100, 150, 300], "model__max_features": [2,4,6,8]
}
]
This is the final step
grid_search = GridSearchCV(pipe, param_grid, cv = cv1_splits,
scoring = "neg_mean_squared_error",
return_train_score = True)
clf = grid_search.fit(X, y)
generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits)
generalization_error
Now, here comes the glitch (bottom two lines of preceding code snippet):
If I follow scikit-learn instructions ( link ), I should write:
generalization_error = cross_val_score(clf, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
generalization_error
Unfortunately calling cross_val_score(clf, X = X...) gives me an error ( indices are out of bound for the train/test splits) and the generalization error array contains only NaNs.
On the other hand, if I write like so:
generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
generalization_error
The script runs flawlessly and I am able to see the generalization error array filled with scores. Can I stick to the last way of doing things, or there is something wrong in the whole process?
For me the problem here can be in the use of cv1_splits
and cv2_splits
, rather than cv1_5
and cv2_5
(in particular it is the use of cv1_splits
to cause the issue).
In general, cross_val_score()
calls fit()
on a clone of the clf
estimator; in such a case it is a GridSearchCV estimator to be fitted onto several X_inner_train
sets ( subsets of X stratified according to cv1_splits
, same dimension of X - see here for notation). Being cv1_splits
built from X, it contains indices which are consistent with respect to X dimension, but which might not be consistent with respect to X_inner_train
dimension.
Instead, by passing cv1_5
to the GridSearchCV estimator, the estimator itself takes care of splitting the inner training sets coherently (see here for reference).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.