简体   繁体   中英

Naive Bayes Classifier using Sklearn.naive_bayes.Bernoulli; how to use model to predict?

I have a file with a training data set like this:

sentence           F1 F2 F3 F4 F5 class
this is a dog      0  1  0  0  0  1   
i like cats        1  0  0  0  0  1 
go to the fridge   0  0  1  0  0  0
i drive a car      0  0  0  1  0  0
i dislike rabbits  0  0  0  0  1  1

I have a set of sentences. I want to predict (in this example, in real life sentences are longer), whether each sentence contains an animal in it or not (the class). I've assigned features to each sentence F1 = is cat mentioned in sentence, F2 = is dog mentioned in sentence, F3 = is fridge mentioned in sentence, F4 = is car mentioned in sentence, F5 = is rabbit mentioned in sentence, class is whether or not animal in sentence).

So then I have another file with a list of sentences (test data set):

dolphins live in the sea
bears live in the woods
there is no milk left
where are the zebras

I want to train a Naive Bayes Classifier using the training data set (the matrix of features above), and then use the model that's made on the test file of sentences. Can I do this?

I tried this:

import numpy as np
import sklearn.naive_bayes import BernoulliNB

sentence = []
feature1 = []
feature2 = []
feature3 = []
feature4 = []
feature5 = []
class_name = []

test_dataset = [line.strip() for line in open(sys.argv[2])]

for line in open(sys.argv[1]):
    line = line.strip().split('\t')
    sentence.append(line[0])
    feature1.append(line[1])
    feature2.append(line[2])
    feature3.append(line[3])
    feature4.append(line[4])
    feature5.append(line[5])
    class_name.append(line[6])

list_of_features = [feature1,feature2,feature3,feature4,feature5]

#I'm not sure if this is right: question 1 below
clf = BernoulliNB()
clf.fit(list_of_features,class_name)

# I'm not sure what goes in the next line: question 2 below
print clf.predict(??)

I have some questions.

  1. When I run the code up to the sentence clf.fit, I get the error:

    File "naive_bayes.py", line 28, in clf.fit(list_of_features,class_name) File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 527, in fit X, y = check_X_y(X, y, 'csr') File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 520, in check_X_y check_consistent_length(X, y) File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 176, in check_consistent_length "%s" % str(uniques)) ValueError: Found arrays with inconsistent numbers of samples: [ 5 10]

But when I count the length of my lists, they seem to be all the same length? Could anyone shed light on what I'm doing wrong here?

  1. My second question is it correct for the line 'print clf.predict()' to read 'print clf.predict(test_dataset)' (ie, a list of sentences, with no features or classes attached, that I want to assigned to class 0 or 1; I can't test this at the minute as I can't seem to get past my error from question 1).

  2. As a side note, once I can eventually get this to work, it would be great to somehow work out the accuracy of the predictor. However I'm struggling to get the basics to work first.

Edit 1: Revised Script

import numpy as np
from sklearn.naive_bayes import BernoulliNB
import sys

sentence = []
feature1 = []
feature2 = []
feature3 = []
feature4 = []
feature5 = []
class_name = []

for line in open(sys.argv[1]):
    line = line.strip().split('\t')
    sentence.append(line[0])
    feature1.append(int(line[1]))
    feature2.append(int(line[2]))
    feature3.append(int(line[3]))
    feature4.append(int(line[4]))
    feature5.append(int(line[5]))
    class_name.append(int(line[6]))

print feature1
print feature2
print feature3
print feature4
print feature5
print class_name


list_of_features = [feature1,feature2,feature3,feature4,feature5]
transpos_list_of_features = np.array(list_of_features).T
clf = BernoulliNB()
print clf.fit(transpos_list_of_features,class_name)
#print clf.predict(??)

The output:

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 1, 1, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

1- There are couple of issues here.

  • First, make sure you parse the file correctly. If your train file is exactly like above lines, then you may want to skip the first line. It contains the header and should not be in your X or Y matrix. Make sure feature and class_name variables contains what you want. You can check them by printing.
  • sentence.append(line[0]) I guess, you are getting string '0' or '1' instead of integer values. I don't think this scikit module can work with string values. You should cast them to integer. It could be something like sentence.append(int(line[0]))
  • list_of_features variable is no_of_features x no_of_features matrix. It's shape should be n_samples x n_features. You can transpose it by list_of_features = np.array(list_of_features).T

2 - Classifier has no idea how to map sentences to features so you have to give features explicitly. You can achieve this by traversing sentence and checking whether target words exist.

Edit:

import numpy as np
from sklearn.naive_bayes import BernoulliNB

feature_word_list = ["cat", "dog", "fridge", "car", "rabbit"]
feature1 = [0, 1, 0, 0, 0]
feature2 = [1, 0, 0, 0, 0]
feature3 = [0, 0, 1, 0, 0]
feature4 = [0, 0, 0, 1, 0]
feature5 = [1, 1, 0, 0, 1]
class_name_list = [1, 1, 0, 0, 1]

train_features = np.array([feature1,feature2,feature3,feature4,feature5]).T

clf = BernoulliNB()
clf.fit(train_features, class_name_list)

Code above is the same except I put feature values directly without reading from file.

test_data = ["this is about dog and cats","not animal related sentence"]
test_feature_list = []
for test_instance in test_data:
  test_feature = [1 if feature_word in test_instance else 0 for feature_word in feature_word_list]
  test_feature_list.append(test_feature) 

test_feature_matrix = np.array(test_feature_list)
print(test_feature_matrix)

Now your test_feature_matrix would look like this:

[[1 1 0 0 0]
 [0 0 0 0 0]]

Note that I have 2 test data so matrix has 2 corresponding rows and each column represents a feature value (ie whether a particular word exists in the sentence or not). That is what I tried to say in point 2, classifier does not know "cat", "fridge" or something else but what it needs is whether word exists or not, 1 or 0.

Now you can predict labels of these test data (sentences):

predicted_label_list = clf.predict(test_feature_matrix)
print(predicted_label_list)

which gives the result of

[1 0]

Note: It may not work with your test data since it contains words that are not in your feature space and training data. What I mean is your test data contains "zebra" but there is no "zebra" in training set so it may classified as 0.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM