简体   繁体   中英

Classification of text using naive bayes in python

I have created a model in which I am running Naive Bayes to get the expected output.

from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
('Agree Completely Agree Strongly Agree Somewhat Disagree Somewhat Disagree Strongly Completely Disagree','TRUE'),
('Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
('1 - disagree strongly 2 - disagree somewhat 3 - neither agree nor disagree 4 - agree somewhat 5 - agree strongly','TRUE'),
('1 - doesn\'t apply at all 2 3 4 5 6 7 - applies completely','TRUE'),
('1 - extremely new and different 2 3 4 5 6 7 - not at all new & different','TRUE'),
('1 - extremely relevant 2 3 4 5 6 7 - not at all relevant','TRUE'),
('1 - I don\'t want brands to engage with me at all on social media 2 3 4 5 6 7 - I love to engage with brands on social media','TRUE'),
    ('1 - Most Important 2 3 4 5 - Least Important','TRUE'),    
    ('pepsi','FALSE'),
    ('coca cola','FALSE'),
    ('hyundai','FALSE'),        
    ('Audio quality','FALSE'),
    ('Product features ','FALSE'),
    ('Content ','FALSE')
]
test_corpus = [
    ('1 - Agree Completely 2 - Agree Strongly 3 - Agree Somewhat 4 - Disagree Somewhat 5 - Disagree Strongly 6 - Completely Disagree','TRUE'),
    ('1 - Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
    ('Content ','FALSE'),
    ('Ease of navigation','FALSE')
]
model = NBC(training_corpus) 
print(model.classify('pepsi'))
print(model.accuracy(test_corpus)*100)

When I run this code, it is showing 100% efficiency but returning FALSE for and every time. I am not sure of what is wrong, but that's not the expected output.

Your model is ok, It's just your data and classifier.
I mean by training data you provided, It works good, let's test a bit:

def test(s):
    prob_dist = model.prob_classify(s)
    print("classifiying", s)
    print("possibility of being FALSE:", round(prob_dist.prob("FALSE"), 2), 
          "possibility of being TRUE:" ,round(prob_dist.prob("TRUE"), 2))
    print('-'*70)

test_cases = ['1', '1 - ', '2', '2 3 4 5', '1- 2 3 4 5', 'pepsi', 'coca', 'BMW']
for tc in test_cases:
    test(tc)

now here is the output, it's quite good,

classifiying 1
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 1 - 
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2 3 4 5
possibility of being FALSE: 0.05 possibility of being TRUE: 0.95
----------------------------------------------------------------------
classifiying 1- 2 3 4 5
possibility of being FALSE: 0.0 possibility of being TRUE: 1.0
----------------------------------------------------------------------
classifiying pepsi
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying coca
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying BMW
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
--------------------------------------------------------------------

OK, now you wanna know why classifier works like this? look at your code, where have you mentioned the feature vector? no where, so it uses the default function for extracting feature vectors as explained here . (you can see take a look at the source code )

for example your model features can be seen like this:

model.show_informative_features()


>>> Most Informative Features
             contains(4) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(3) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(5) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(2) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(1) = False           FALSE : TRUE   =      3.3 : 1.0
             contains(7) = False           FALSE : TRUE   =      2.4 : 1.0
             contains(6) = False           FALSE : TRUE   =      2.4 : 1.0
            contains(at) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(all) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(not) = False           FALSE : TRUE   =      1.3 : 1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM