使用 K 折交叉验证标准化数据

Question

I'm using StratifiedKFold so my code looks like this我正在使用 StratifiedKFold 所以我的代码看起来像这样

def train_model(X,y,X_test,folds,model):
    scores=[]
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
        X_train,X_valid = X[train_index],X[valid_index]
        y_train,y_valid = y[train_index],y[valid_index]        
        model.fit(X_train,y_train)
        y_pred_valid = model.predict(X_valid).reshape(-1,)
        scores.append(roc_auc_score(y_valid, y_pred_valid))
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
folds = StratifiedKFold(10,shuffle=True,random_state=0)
lr = LogisticRegression(class_weight='balanced',penalty='l1',C=0.1,solver='liblinear')
train_model(X_train,y_train,X_test,repeted_folds,lr)

now before train the model I want to standardize the data so which is the correct way?现在在训练 model 之前我想标准化数据，那么正确的方法是什么？
1) 1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

doing this before calling train_model function在调用 train_model function 之前执行此操作

2) 2)
doing standardization inside function like this像这样在 function 内部进行标准化

def train_model(X,y,X_test,folds,model):
    scores=[]
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
        X_train,X_valid = X[train_index],X[valid_index]
        y_train,y_valid = y[train_index],y[valid_index]
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_vaid = scaler.transform(X_valid)
        X_test = scaler.transform(X_test)
        model.fit(X_train,y_train)
        y_pred_valid = model.predict(X_valid).reshape(-1,)

        scores.append(roc_auc_score(y_valid, y_pred_valid))

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

As per my knowlwdge in 2nd option I'm not leaking the data.so which way is correct if I'm not using pipeline and also how to use pipeline if i want to use cross validation?根据我在第二个选项中的知识，我没有泄漏数据。所以如果我不使用管道，哪种方式是正确的，如果我想使用交叉验证，如何使用管道？

Answer 1

Indeed the second option is better because the scaler does not see the values of X_valid to scale X_train .实际上，第二个选项更好，因为缩放器看不到X_valid的值来缩放X_train 。

Now if you were to use a pipeline, you can do:现在，如果您要使用管道，您可以执行以下操作：

from sklearn.pipeline import make_pipeline

def train_model(X,y,X_test,folds,model):
    pipeline = make_pipeline(StandardScaler(), model)
    ...

And then use pipeline instead of model .然后使用pipeline代替model 。 At every fit or predict call, it will automatically standardize the data at hand.在每次fit或predict调用时，它都会自动标准化手头的数据。

Note that you can also use the cross_val_score function from scikit-learn, with the parameter scoring='roc_auc' .请注意，您还可以使用scikit-learn中的 cross_val_score function，参数scoring='roc_auc' 。

Answer 2

When to standardize your data may be a question better suited for Cross Validated .何时标准化您的数据可能是更适合Cross Validated的问题。

IMO if your data are large then it probably doesn't matter too much (if you're using k-fold this may not be the case) but since you can, it's better to do it within your cross validation (k-fold), or option 2. IMO，如果您的数据很大，那么它可能无关紧要（如果您使用的是 k-fold，情况可能并非如此），但是既然可以，最好在交叉验证（k-fold）中进行，或选项 2。

Also, see this for more information on overfitting in cross validation.此外，有关交叉验证中过度拟合的更多信息，请参阅此内容。

使用 K 折交叉验证标准化数据

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-11-19 17:34:33

解决方案2
0 2019-11-19 17:27:59

使用 K 折交叉验证标准化数据

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-11-19 17:34:33

解决方案2 0 2019-11-19 17:27:59

解决方案1
1 已采纳 2019-11-19 17:34:33

解决方案2
0 2019-11-19 17:27:59