[英]Sklearn cross_val_score gives significantly differnt number than model.score?
I have a binary classification problem我有一个二元分类问题
First I train test split my data as:首先,我训练测试将我的数据拆分为:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
I checked the y_train and it had basically a 50/50 split of the two classes (1,0) which is how the dataset it我检查了 y_train,它基本上有两个类(1,0)的 50/50 拆分,这就是它的数据集
when I try a base model such as:当我尝试基本 model 时,例如:
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_train, y_train)
the output is 0.98
or something 1% different depending on the random state of the train test split. output 是
0.98
或 1% 的差异,具体取决于火车测试拆分的随机 state。
HOWEVER, when I try a cross_val_score such as:但是,当我尝试使用 cross_val_score 时,例如:
cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='accuracy')
the output is output 是
array([0.65 , 0.78333333, 0.78333333, 0.66666667, 0.76666667])
none of the scores in the array are even close to 0.98?数组中的分数都没有接近 0.98?
and when I tried scoring = 'r2' I got当我尝试打分 = 'r2' 我得到了
>>>cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='r2')
array([-0.20133482, -0.00111235, -0.2 , -0.2 , -0.13333333])
Does anyone know why this is happening?有谁知道为什么会这样? I have tried
Shuffle
= True
and False
but it doesn't help.我试过
Shuffle
= True
and False
但它没有帮助。
Thanks in advance提前致谢
In your base model, you compute your score on the training corpus.在您的基础 model 中,您可以在训练语料库上计算您的分数。 While this is a proper way to ensure your model has actually learnt from the data you fed it, it doesn't ensure the final accuracy of your model on new and unseen data.
虽然这是确保您的 model 实际上从您提供的数据中学习的正确方法,但它并不能确保您的 model 在新的和看不见的数据上的最终准确性。
Not 100% sure (I don't know well scikit-learn), but I'd expect cross_val_score
to actually split the X_train
and y_train
into a training and a testing set.不是 100% 肯定(我不太了解 scikit-learn),但我希望
cross_val_score
实际上将X_train
和y_train
拆分为训练集和测试集。
So as you compute a score on data unseen during the training, the accuracy will be much lower.因此,当您计算训练期间未见数据的分数时,准确度会低得多。 Try to compare these results with
model.score(X_test, y_test)
, it should be much closer.尝试将这些结果与
model.score(X_test, y_test)
进行比较,它应该更接近。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.