I am using xgboost python to perform text classification
Below is the trainset I am considering
itemid description category
11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers
10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons
9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs
I am constructing document term matrix of description using Sckit learn's counvectorizer which generate scipy matrix(As I have huge data of 1.1million I am using sparse representation to reduce space complexity) using below code
countvec = CountVectorizer()
documenttermmatrix=countvec.fit_transform(trainset['description'])
After that I will apply feature selection for the above matrix using
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
documenttermmatrix_train= fs.fit_transform(documenttermmatrix,y1_train)
I am using xgboost classifier to train the model
model = XGBClassifier(silent=False)
model.fit(documenttermmatrix_train, y_train,verbose=True)
Below is the testset i am considering
itemid description category
9836442 TRIPLE Space heaters Architectural Diffusers
13863918 pushbutton switch Door Bell Pushbuttons
I am constructing seperate matrix for test set as I did it for train set using below code
documenttermmatrix_test=countvec.fit_transform(testset['description'])
while preicting testset Xgboost expects all the features of trainset to be in testset but it is not possible( sparse matrix represents only non-zero entries)
I cannot combine train and test set into single dataset as i need to do feature selection only for trainset
Can anyone tell how can I approach further?
Instead of using countvec.fit_transform()
on testset, only use transform()
.
Change this line:
documenttermmatrix_test=countvec.fit_transform(testset['description'])
To this:
documenttermmatrix_test=countvec.transform(testset['description'])
This will make sure that those features which are present in training set are only taken from the test set and if not available, put 0 there.
fit_transform() will forget the previous trained data and make new matrix which can have different features than previous output. Hence the error.
You have to use fit_transform
on train set, but only transform on your test set. Therefore the default output of countvectorizer
is a csr matrix. It doesn't work with XGBClissifier
, you have to convert it to csc matrix. Simply do: X = csc_matrix(X)
.
There is no easy way around this issue, common as it is. XGBoost and other tree-based models can handle test sets with more variables than the training set (since it can ignore them), but never fewer (since it's expecting to make decisions on them). That being the case, you have some options, in descending order of desirability / likelihood to solve your problem:
Don't use a sparse matrix. Unless you're building this model inside a real-time application or otherwise prohibitive production environment, the easiest thing to do is use an ordinary matrix that will keep columns of zeros.
Look at how you're partitioning your data. It may be that there are only one or two factors with an unbalanced split, in which case you might be able to get more equal representation by playing around with scikit learn's train_test_split()
functionality .
Prune the data yourself . Similar to option 2, if you think a couple entries are the culprits, and that their removal wouldn't hurt your model, you can try removing them from the original dataset. This is, of course, the least desirable option, but if they really are that few and far between, they won't affect the predictive power of your model.
But broadly this is a sign of an unhealthy dataset. I would also advise looking at other ways you might bin or categorize your data into fewer groups so that this isn't a problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.