简体   繁体   English

sci-kit学习中的交叉验证和管道

[英]Cross validation and pipeline in sci-kit learn

For a machine learning project, i'm trying to predict a categorical outcome variable using features extracted from text. 对于机器学习项目,我正在尝试使用从文本中提取的特征来预测分类结果变量。

Using cross validation, i split my X and Y into a test set and training set. 使用交叉验证,我将X和Y分为测试集和训练集。 The training set is trained using a pipeline. 使用管道对训练集进行训练。 However, when i compute the performance using X from my test set my performance is 0.0. 但是,当我从测试集中使用X计算性能时,我的性能为0.0。 This is while there are no features extracted from X_test yet. 这是尚未从X_test提取特征的时候。

Is it possible to split the dataset within the pipeline? 是否可以在管道内拆分数据集?

My code: 我的代码:

X, Y = read_data('development2.csv')

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
                 ('tfidf', TfidfTransformer(use_idf=False)),
                 ('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
                 ])

train_pipeline.fit(X_train, Y_train)

predicted = train_pipeline.predict(X_test)

print accuracy_score(Y_test, predicted)

The traceback when using SVC: 使用SVC时的回溯:

File     "/Users/Robbert/Documents/pipeline.py", line     62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)

I solved the problem. 我解决了问题。

The target variable (Y) did not have the appropriate format. 目标变量(Y)的格式不正确。 The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]] . 变量存储如下: [[0 0 0 0 1],[0 0 1 0 0]] I converted this to a different array format like this: [5, 3] . 我将其转换为以下不同的数组格式: [5, 3]

This did the trick for me. 这帮了我大忙。

Thanks for all answers. 感谢所有答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM