cross_val_score 是否不适合实际输入 model？

Question

I am working on a project in which I am dealing with a large dataset.我正在处理一个处理大型数据集的项目。

I need to train the SVM classifier within the KFold cross-validation library from Sklearn.我需要在 Sklearn 的 KFold 交叉验证库中训练 SVM 分类器。

import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score


x__df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/x_train_syn.csv')
y_df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/y_train_syn.csv')

svm_clf = svm.SVC(kernel='poly', gamma=1, class_weight=None, max_iter=20000, C = 100, tol=1e-5)
X = x__df_chunk_synth
Y = y_df_chunk_synth
scores = cross_val_score(svm_clf, X, Y,cv = 5, scoring = 'f1_weighted')
print(scores)
    
pred = svm_clf.predict(chunk_test_x)
accuracy = accuracy_score(chunk_test_y,pred)

print(accuracy)

I am using the above-mentioned code.我正在使用上述代码。 I understand that I am training my classifier within the function of cross_val_score and hence whenever I am trying to call the classifier outside for the prediction on test data, I am getting an error:我知道我正在 cross_val_score 的 function 中训练我的分类器，因此每当我尝试调用外部分类器来预测测试数据时，我都会收到错误消息：

sklearn.exceptions.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Is there any other option of doing the same thing in the correct way?有没有其他选择以正确的方式做同样的事情？

Please help me with this issue.请帮我解决这个问题。

Answer 1

Indeed model_selection.cross_val_score uses the input model to fit the data, so it doesn't have to be fitted.实际上model_selection.cross_val_score使用输入 model 来拟合数据，因此不必拟合。 However, it does not fit the actual object used as input, rather a copy of it, hence the error This SVC instance is not fitted yet... when trying to predict.但是，它不适合用作输入的实际 object，而是它的副本，因此错误This SVC instance is not fitted yet...尝试预测时。

Looking into the source code in cross_validate which is called in cross_val_score , in the scoring step, the estimator goes through clone first:查看在cross_validate中调用的cross_val_score中的源代码，在评分步骤中， estimator首先通过clone ：

scores = parallel(
    delayed(_fit_and_score)(
        clone(estimator), X, y, scorers, train, test, verbose, None,
        fit_params, return_train_score=return_train_score,
        return_times=True, return_estimator=return_estimator,
        error_score=error_score)
    for train, test in cv.split(X, y, groups))

Which creates a deep copy of the model (which is why the actual input model is not fitted):这会创建 model 的深层副本（这就是未安装实际输入 model 的原因）：

def clone(estimator, *, safe=True):
    """Constructs a new estimator with the same parameters.
    Clone does a deep copy of the model in an estimator
    without actually copying attached data. It yields a new estimator
    with the same parameters that has not been fit on any data.
    ...

cross_val_score 是否不适合实际输入 model？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-04 21:20:24

cross_val_score 是否不适合实际输入 model？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-04 21:20:24

解决方案1
1 已采纳 2020-07-04 21:20:24