![](/img/trans.png)
[英]Unexpected cross-validation scores with scikit-learn LinearRegression
[英]How to compute correclty cross validation scores in scikit-learn?
我正在做分类任务。 不过,我的结果略有不同:
#First Approach
kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=False)
pipe= make_pipeline(SVC())
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision')))
#Second Approach
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print ('Precision:', precision_score(y_test, y_pred,average='binary'))
#Third approach
pipe= make_pipeline(SCV())
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision')))
#Fourth approach
pipe= make_pipeline(SVC())
print('Precision',np.mean(cross_val_score(pipe, X_train, y_train, cv=kf, scoring='precision')))
日期:
Precision: 0.780422106837
Precision: 0.782051282051
Precision: 0.801544091998
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1431 train, test, verbose, None,
1432 fit_params)
-> 1433 for train, test in cv)
1434 return np.array(scores)[:, 0]
1435
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
798 # was dispatched. In particular this covers the edge
799 # case of Parallel used with an exhausted iterator.
--> 800 while self.dispatch_one_batch(iterator):
801 self._iterating = True
802 else:
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
656 return False
657 else:
--> 658 self._dispatch(tasks)
659 return True
660
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
564
565 if self._pool is None:
--> 566 job = ImmediateComputeBatch(batch)
567 self._jobs.append(job)
568 self.n_dispatched_batches += 1
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
178 # Don't delay the application, to avoid keeping the input
179 # arguments in memory
--> 180 self.results = batch()
181
182 def get(self):
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1522 start_time = time.time()
1523
-> 1524 X_train, y_train = _safe_split(estimator, X, y, train)
1525 X_test, y_test = _safe_split(estimator, X, y, test, train)
1526
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _safe_split(estimator, X, y, indices, train_indices)
1589 X_subset = X[np.ix_(indices, train_indices)]
1590 else:
-> 1591 X_subset = safe_indexing(X, indices)
1592
1593 if y is not None:
/usr/local/lib/python3.5/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices)
161 indices.dtype.kind == 'i'):
162 # This is often substantially faster than X[indices]
--> 163 return X.take(indices, axis=0)
164 else:
165 return X[indices]
IndexError: index 900 is out of bounds for size 900
所以,我的问题是上述哪种方法对于计算交叉验证的指标是正确的? 我相信我的分数受到污染,因为我对何时执行交叉验证感到困惑。 那么,任何关于如何正确执行交叉验证分数的想法?
UPDATE
在培训步骤中进行评估?
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = False)
clf = make_pipeline(SVC())
# However, fot clf, you can use whatever estimator you like
kf = StratifiedKFold(y = y_train, n_folds=10, shuffle=True, random_state=False)
scores = cross_val_score(clf, X_train, y_train, cv = kf, scoring='precision')
print('Mean score : ', np.mean(scores))
print('Score variance : ', np.var(scores))
对于任何分类任务,使用StratifiedKFold交叉验证分割始终是好的。 在分层KFold中,每个类别的样本数量与您的分类问题相同。
那么这取决于您的分类问题类型。 总是很高兴看到精确度和召回分数。 如果是二元分类偏差,人们倾向于使用ROC AUC分数:
from sklearn import metrics
metrics.roc_auc_score(ytest, ypred)
让我们看看你的解决方案:
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import precision_score
from sklearn.cross_validation import KFold
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
np.random.seed(1337)
X = np.random.rand(1000,5)
y = np.random.randint(0,2,1000)
kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=42)
pipe= make_pipeline(SVC(random_state=42))
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision')))
# Here you are evaluating precision score on X_train.
#Second Approach
clf = SVC(random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print ('Precision:', precision_score(y_test, y_pred, average='binary'))
# here you are evaluating precision score on X_test
#Third approach
pipe= make_pipeline(SVC())
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision')))
# Here you are splitting the data again and evaluating mean on each fold
因此,结果是不同的
首先,如文档中所述并在一些示例中显示 , scikit-learn
交叉验证cross_val_score
执行以下操作:
X
拆分为N个折叠(根据参数cv
)。 它相应地分割标签y
。 estimator
)在N-1之前的折叠上训练它。 scoring
) 让我们来看看你的每一个方法。
第一种方法:
为什么你会在cross_validation之前拆分训练集,因为scikit-learn函数会为你做这个? 因此,您可以使用较少的数据训练模型,并以值得验证的分数结束
第二种方法:
在这里,您在数据上使用另一个指标而不是cross_validation_sore
。 因此,您无法将其与其他验证分数进行比较 - 因为它们是两个不同的东西。 一个是经典的误差百分比,而precision
是用于校准二元分类器的度量(真或假)。 这是一个很好的指标(您可以检查ROC曲线,精确度和召回指标),但只比较这些指标。
第三种方法:
这个是更自然的一个。 这个分数很好 (我的意思是如果你想将它与其他分类器/估算器进行比较)。 但是,我会警告你不要直接采取平均值。 因为有两件事你可以比较:平均值,也可以是方差。 阵列的每个分数都与另一个分数不同,你可能想知道与其他估算者相比多少(你肯定希望你的方差尽可能小)
第四种方法:
Kfold
似乎存在与Kfold
无关的cross_val_score
最后:
仅使用第二种或第三种方法来比较估算器。 但他们肯定不会估计同样的事情 - 精确度与错误率。
clf = make_pipeline(SVC())
# However, fot clf, you can use whatever estimator you like
scores = cross_val_score(clf, X, y, cv = 10, scoring='precision')
print('Mean score : ', np.mean(scores))
print('Score variance : ', np.var(scores))
通过将clf
更改为另一个估算器(或将其集成到循环中),您可以为每个eastimator分数并比较它们
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.