简体   繁体   中英

Nested cross validation with stratified folds

I am trying to implement a random forest regressor using scikit-learn pipes and nested cross-validation. The dataset is about housing prices, with several features (some numeric other categorical) and a continuous target variable (median_house_value).

Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 

I decided to manually create two stratified 5-folds splits (inner, outer loop for nested cv). The stratification is based upon a modified version of the median_income feature:

df.insert(9, "income_cat", 
                  pd.cut(df["median_income"],bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1,2,3,4,5]))

This is the code for the splitting folds

cv1_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 42)
cv1_splits = []

# create first 5 stratified folds indices
for train_index, test_index in cv1_5.split(df, df["income_cat"]):
    cv1_splits.append((train_index, test_index))

cv2_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 43)
cv2_splits = []

# create second 5 stratified folds indices
for train_index, test_index in cv2_5.split(df, df["income_cat"]):
    cv2_splits.append((train_index, test_index))
    
# set initial dataset
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()

This is the preprocess pipe

# create preprocess pipe
preprocess_pipe = Pipeline(
    [
        ("ctransformer", ColumnTransformer([
                ( 
                    "num_pipe", 
                    Pipeline([
                        ("imputer", SimpleImputer(strategy="median")),
                        ("scaler", StandardScaler())
                    ]), 
                    list(X.select_dtypes(include=[np.number]))
                ),
                ( 
                    "cat_pipe", 
                    Pipeline([
                        ("encoder", OneHotEncoder()),
                    ]), 
                    ["ocean_proximity"])
            ])
        ),
    ]
)

And this is the final pipe (including the preprocess one)

pipe = Pipeline([
    ("preprocess", preprocess_pipe),
    ("model", RandomForestRegressor())
])

I am using nested cross validation to tune the hyperparameters of the final pipe and compute the generalization error

Here is the parameter grid

param_grid = [
    {
        "preprocess__ctransformer__num_pipe__imputer__strategy": ["mean","median"],
        "model__n_estimators": [3, 10, 30, 50, 100, 150, 300], "model__max_features": [2,4,6,8]
    }
]

This is the final step

grid_search = GridSearchCV(pipe, param_grid, cv = cv1_splits, 
    scoring = "neg_mean_squared_error", 
    return_train_score = True)

clf = grid_search.fit(X, y)

generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits)
generalization_error

Now, here comes the glitch (bottom two lines of preceding code snippet):

If I follow scikit-learn instructions ( link ), I should write:

generalization_error = cross_val_score(clf, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
    generalization_error

Unfortunately calling cross_val_score(clf, X = X...) gives me an error ( indices are out of bound for the train/test splits) and the generalization error array contains only NaNs.

On the other hand, if I write like so:

generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
        generalization_error

The script runs flawlessly and I am able to see the generalization error array filled with scores. Can I stick to the last way of doing things, or there is something wrong in the whole process?

For me the problem here can be in the use of cv1_splits and cv2_splits , rather than cv1_5 and cv2_5 (in particular it is the use of cv1_splits to cause the issue).

In general, cross_val_score() calls fit() on a clone of the clf estimator; in such a case it is a GridSearchCV estimator to be fitted onto several X_inner_train sets ( subsets of X stratified according to cv1_splits , same dimension of X - see here for notation). Being cv1_splits built from X, it contains indices which are consistent with respect to X dimension, but which might not be consistent with respect to X_inner_train dimension.

Instead, by passing cv1_5 to the GridSearchCV estimator, the estimator itself takes care of splitting the inner training sets coherently (see here for reference).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM