Error due to different number of features in test and train sets after TF-IDF transform

Question

I am trying to create an AI that reads my dataset and states whether an input outside the data is 1 or 0

My dataset has column for qualitative data and column for a boolean. Here is a sample from it:

Imports:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import re
import string

Open and cleaning dataset:

saisei_data = saisei_data.dropna(how='any',axis=0) 
saisei_data = saisei_data.sample(frac=1)
X = saisei_data['Data']
y = saisei_data['Conscious']
saisei_data

Vectorisation:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.fit_transform(X_test)

Example Algorithm - Logistic Regression:

LR = LogisticRegression()
LR.fit(xv_train,y_train)
pred_lr=LR.predict(xv_test) # Here is where I get an error

Everything works fine until I predict using the logistic regression algorithm.

The Error:

ValueError: X has 112 features per sample; expecting 23

This seems to change to similar errors such as:

ValueError: X has 92 features per sample; expecting 45

I am new to machine learning so I don't really know what I'm doing when it comes to using the algorithms, however I tried printing the xv_test variable and here is a sample of the output (also changes often):

Any ideas?

Answer 1

That is because you erroneously apply .fit_transform() to your test data; and, in this case, you are lucky enough that the process produces a programming error, thus alerting you that you are doing something methodologically wrong (which is not always the case).

We never apply either .fit() or .fit_transform() to unseen (test) data. The fitting should be done only once with the training data, like you have done here:

xv_train = vectorization.fit_transform(X_train)

For subsequent transformations of unseen (test) data, we use only .transform() . So, your next line should be

xv_test = vectorization.transform(X_test)

That way, the features in the test set will be the same with the ones in the training set, as it should be in the first place.

Notice the difference between the two methods in the docs (emphasis mine):

fit_transform :

Learn vocabulary and idf, return document-term matrix.

transform :

Transform documents to document-term matrix.

Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

and recall that we don't ever use the test set to learn anything.

So, simple general mnemonic rule, applicable practically everywhere:

The terms "fit" and "test data" are always ( always ...) incompatible; mixing them will create havoc.

Error due to different number of features in test and train sets after TF-IDF transform

Question

Imports:

Open and cleaning dataset:

Vectorisation:

Example Algorithm - Logistic Regression:

The Error:

1 answers

solution1
1 ACCPTED 2021-03-23 01:39:54

Error due to different number of features in test and train sets after TF-IDF transform

Question

Imports:

Open and cleaning dataset:

Vectorisation:

Example Algorithm - Logistic Regression:

The Error:

1 answers

solution1 1 ACCPTED 2021-03-23 01:39:54

solution1
1 ACCPTED 2021-03-23 01:39:54