简体   繁体   中英

After Training a Naive Bayes Text Classification Algorithm, how to predict topic of a Single text file

I have Trained and test the Naive Bayes Algorithm using a text and train data. Now i want to predict the topic of a single text file.

Here is my code,

#importing test, train data
import sklearn.datasets as skd

categories = ['business', 'entertainment','local', 'sports', 'world']
sinhala_train = skd.load_files('Cleant data\stemmed_filtered_sinhala-set1', categories= categories, encoding= 'utf-8')
sinhala_test = skd.load_files('Cleant data\stemmed_filtered_sinhala-set2',categories= categories, encoding= 'utf-8')
name_file = "adaderana_67571.txt"
A = open(name_file, encoding='utf-8')
new_file = A.read()

from sklearn.feature_extraction.text import CountVectorizer
count_vectorization = CountVectorizer()
train_data_tf = count_vectorization.fit_transform(sinhala_train.data)
train_data_tf.shape

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()
train_data_tfidf = tfidf_trans.fit_transform(train_data_tf)
train_data_tfidf.shape

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_data_tfidf, sinhala_train.target)

test_data_tf = count_vectorization.transform(sinhala_test.data)
test_data_tfidf = tfidf_trans.fit_transform(test_data_tf)
predicted = clf.predict(test_data_tfidf)

from sklearn import metrics
from sklearn.metrics import accuracy_score
print("Accuracy of the model:", accuracy_score(sinhala_test.target, predicted))
print(metrics.classification_report(sinhala_test.target, predicted, target_names=sinhala_test.target_names)),
metrics.confusion_matrix(sinhala_test.target, predicted)

And this is my output,

Accuracy of the model: 0.864
               precision    recall  f1-score   support

     business       0.78      0.94      0.85       100
entertainment       0.95      0.86      0.90       100
        local       0.89      0.65      0.75       100
       sports       0.91      0.93      0.92       100
        world       0.83      0.94      0.88       100

    micro avg       0.86      0.86      0.86       500
    macro avg       0.87      0.86      0.86       500
 weighted avg       0.87      0.86      0.86       500

array([[94,  2,  4,  0,  0],
       [ 2, 86,  2,  4,  6],
       [19,  0, 65,  5, 11],
       [ 1,  3,  1, 93,  2],
       [ 5,  0,  1,  0, 94]], dtype=int64)

Now i want to predict the topic of the text file new_file .

Can someone help me write the code to predict topic for this text file.

I solved my problem. This was the code i used to predict the topic.

docs_new1 = sinhala_test_1
docs_new = [docs_new1]
X_new_counts = count_vectorization.transform(docs_new)
X_new_tfidf = tfidf_trans.transform(X_new_counts)

predicted_topic = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted_topic):
    topic = ( sinhala_train.target_names[category])
return topic

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM