xgboost sklearn中的feature_names不匹配在多类文本分类期间

Question

I am trying to perform a multiclass text classification using xgboost in python (sklearn edition), but at times it errors out telling me that there is a mismatch in feature names. 我试图在python（sklearn版本）中使用xgboost执行多类文本分类，但有时它会错误地告诉我功能名称不匹配。 The odd thing is that at times it does work (perhaps 1 out of 4 times), but the uncertainty is making it difficult for me to rely on this solution for now, even though it is showing encouraging results without even doing any real pre-processing. 奇怪的是，它确实有效（可能是4次中的1次），但不确定性使我现在很难依赖这个解决方案，即使它显示出令人鼓舞的结果，甚至没有做任何真正的预先处理。

I have provided some illustrative sample data in the code that would be similar to what I'd be using. 我在代码中提供了一些类似于我正在使用的示例性示例数据。 The code I currently have is as follows: 我目前的代码如下：

Updated code that reflects maxymoo's suggestion 更新了反映maxymoo建议的代码

import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

rng = np.random.RandomState(31337)    

y = np.array([0, 1, 2, 1, 0, 3, 1, 2, 3, 0])
X = np.array(['milk honey bear bear honey tigger',
          'tom jerry cartoon mouse cat cat WB',
          'peppa pig mommy daddy george peppa pig pig',
          'cartoon jerry tom silly',
          'bear honey hundred year woods',
          'ben holly elves fairies gaston fairy fairies castle king',
          'tom and jerry mouse WB',
          'peppa pig daddy pig rebecca rabit',
          'elves ben holly little kingdom king big people',
          'pot pot pot pot jar winnie pooh disney tigger bear'])

xgb_model = make_pipeline(CountVectorizer(), xgb.XGBClassifier())

kf = KFold(y.shape[0], n_folds=2, shuffle=True, random_state=rng)
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy

The error I tend to get is as follows: 我倾向于得到的错误如下：

Traceback (most recent call last):
  File "main.py", line 95, in <module>
    predictions = xgb_model.predict(X[test_index])
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 465, in predict
    ntree_limit=ntree_limit)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 939, in predict
    self._validate_features(data)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1179, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24']
expected f26, f25 in input data

Any pointers would be really appreciated! 任何指针都会非常感激！

Answer 1

You need to make sure that you are only scoring the model with features that it has been trained on. 您需要确保仅使用已经过训练的功能对模型进行评分。 The usual way to do this is to use a Pipeline to package the vectoriser and the model together. 通常的方法是使用Pipeline将矢量化器和模型打包在一起。 That way they will both be trained at the same time, and if a new feature is encountered in the test data, the vectoriser will just ignore it (also note that you don't need to recreate the model at each stage of the cross-validation, you just initialise it once and then refit it at each fold): 这样，它们将同时进行训练，如果在测试数据中遇到新特征，矢量化器将忽略它（同时请注意，您不需要在交叉的每个阶段重新创建模型 - 验证，你只需初始化一次，然后在每次折叠时重新设置）：

from sklearn.pipeline import make_pipeline    

xgb_model = make_pipeline(CountVectoriser(), xgb.XGBClassifier())
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy

xgboost sklearn中的feature_names不匹配在多类文本分类期间

问题描述

1 个解决方案

解决方案1
0 2016-08-19 04:00:43

xgboost sklearn中的feature_names不匹配在多类文本分类期间

问题描述

1 个解决方案

解决方案1 0 2016-08-19 04:00:43

解决方案1
0 2016-08-19 04:00:43