I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier
What I have tried now is the following:
def test_tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels
(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)
I get this error:
---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True)
---> 14 y_pred = clf.predict(X_test_naive)
ValueError: dimension mismatch
The function mentioned in the error is:
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB()
clf.fit(X_train_naive, y_train_naive)
y_pred = clf.predict(X_test_naive)
return
Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier ), it would be appreciated.
Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch
if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1)
with test_x, test_y = test_tfidf(undersample_train, ngrams = 1)
it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)
When using transformes ( TfidfVectorizer
in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.
The correct way to do this in your case:
def tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels, tfidf_vectorizer
def test_tfidf(data, vectorizer, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
# No need to create a new TfidfVectorizer here!
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = vectorizer.transform(list_corpus)
return X, list_labels
# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB() # Gaussian Naive Bayes
clf.fit(X_train_naive, y_train_naive)
res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
y_pred = clf.predict(X_test_naive)
f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
pres = precision_score(y_pred, y_test_naive, average = 'weighted')
rec = recall_score(y_pred, y_test_naive, average = 'weighted')
acc = accuracy_score(y_pred, y_test_naive)
res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres,
'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)
return res
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.