简体   繁体   English

Sklearn cross_val_score 给出的数字与 model.score 明显不同?

[英]Sklearn cross_val_score gives significantly differnt number than model.score?

I have a binary classification problem我有一个二元分类问题

First I train test split my data as:首先,我训练测试将我的数据拆分为:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

I checked the y_train and it had basically a 50/50 split of the two classes (1,0) which is how the dataset it我检查了 y_train,它基本上有两个类(1,0)的 50/50 拆分,这就是它的数据集

when I try a base model such as:当我尝试基本 model 时,例如:

model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_train, y_train)

the output is 0.98 or something 1% different depending on the random state of the train test split. output 是0.98或 1% 的差异,具体取决于火车测试拆分的随机 state。

HOWEVER, when I try a cross_val_score such as:但是,当我尝试使用 cross_val_score 时,例如:

cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='accuracy')

the output is output 是

array([0.65      , 0.78333333, 0.78333333, 0.66666667, 0.76666667])

none of the scores in the array are even close to 0.98?数组中的分数都没有接近 0.98?

and when I tried scoring = 'r2' I got当我尝试打分 = 'r2' 我得到了

>>>cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='r2')
array([-0.20133482, -0.00111235, -0.2       , -0.2       , -0.13333333])

Does anyone know why this is happening?有谁知道为什么会这样? I have tried Shuffle = True and False but it doesn't help.我试过Shuffle = True and False但它没有帮助。

Thanks in advance提前致谢

In your base model, you compute your score on the training corpus.在您的基础 model 中,您可以在训练语料库上计算您的分数。 While this is a proper way to ensure your model has actually learnt from the data you fed it, it doesn't ensure the final accuracy of your model on new and unseen data.虽然这是确保您的 model 实际上从您提供的数据中学习的正确方法,但它并不能确保您的 model 在新的和看不见的数据上的最终准确性。

Not 100% sure (I don't know well scikit-learn), but I'd expect cross_val_score to actually split the X_train and y_train into a training and a testing set.不是 100% 肯定(我不太了解 scikit-learn),但我希望cross_val_score实际上将X_trainy_train拆分为训练集和测试集。

So as you compute a score on data unseen during the training, the accuracy will be much lower.因此,当您计算训练期间未见数据的分数时,准确度会低得多。 Try to compare these results with model.score(X_test, y_test) , it should be much closer.尝试将这些结果与model.score(X_test, y_test)进行比较,它应该更接近。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM