简体   繁体   English

使用Sklearn.naive_bayes.Bernoulli的朴素贝叶斯分类器; 如何使用模型进行预测?

[英]Naive Bayes Classifier using Sklearn.naive_bayes.Bernoulli; how to use model to predict?

I have a file with a training data set like this: 我有一个带有训练数据集的文件,如下所示:

sentence           F1 F2 F3 F4 F5 class
this is a dog      0  1  0  0  0  1   
i like cats        1  0  0  0  0  1 
go to the fridge   0  0  1  0  0  0
i drive a car      0  0  0  1  0  0
i dislike rabbits  0  0  0  0  1  1

I have a set of sentences. 我有一组句子。 I want to predict (in this example, in real life sentences are longer), whether each sentence contains an animal in it or not (the class). 我想预测(在此示例中,在现实生活中句子更长),每个句子中是否包含动物(类)。 I've assigned features to each sentence F1 = is cat mentioned in sentence, F2 = is dog mentioned in sentence, F3 = is fridge mentioned in sentence, F4 = is car mentioned in sentence, F5 = is rabbit mentioned in sentence, class is whether or not animal in sentence). 我已为每个句子分配功能F1 =句子中提到的猫,F2 =句子中提到的狗,F3 =句子中提到的冰箱,F4 =句子中提到的汽车,F5 =句子中提到的兔子,类别是不论动物是否被判刑)。

So then I have another file with a list of sentences (test data set): 因此,我有了另一个带有句子列表(测试数据集)的文件:

dolphins live in the sea
bears live in the woods
there is no milk left
where are the zebras

I want to train a Naive Bayes Classifier using the training data set (the matrix of features above), and then use the model that's made on the test file of sentences. 我想使用训练数据集(上面的特征矩阵)训练Naive Bayes分类器,然后使用在句子的测试文件上建立的模型。 Can I do this? 我可以这样做吗?

I tried this: 我尝试了这个:

import numpy as np
import sklearn.naive_bayes import BernoulliNB

sentence = []
feature1 = []
feature2 = []
feature3 = []
feature4 = []
feature5 = []
class_name = []

test_dataset = [line.strip() for line in open(sys.argv[2])]

for line in open(sys.argv[1]):
    line = line.strip().split('\t')
    sentence.append(line[0])
    feature1.append(line[1])
    feature2.append(line[2])
    feature3.append(line[3])
    feature4.append(line[4])
    feature5.append(line[5])
    class_name.append(line[6])

list_of_features = [feature1,feature2,feature3,feature4,feature5]

#I'm not sure if this is right: question 1 below
clf = BernoulliNB()
clf.fit(list_of_features,class_name)

# I'm not sure what goes in the next line: question 2 below
print clf.predict(??)

I have some questions. 我有一些疑问。

  1. When I run the code up to the sentence clf.fit, I get the error: 当我将代码运行到clf.fit句子时,出现错误:

    File "naive_bayes.py", line 28, in clf.fit(list_of_features,class_name) File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 527, in fit X, y = check_X_y(X, y, 'csr') File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 520, in check_X_y check_consistent_length(X, y) File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 176, in check_consistent_length "%s" % str(uniques)) ValueError: Found arrays with inconsistent numbers of samples: [ 5 10] 文件“ /usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py”的第57行在clf.fit中的文件“ naive_bayes.py”,第28行,在拟合X中, y = check_X_y(X,y,'csr')文件“ /usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”,行520,在check_X_y中check_consistent_length(X,y)文件“ /usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”,行176,在check_consistent_length“%s”%str(uniques)中)ValueError:找到样本数量不一致的数组: [5 10]

But when I count the length of my lists, they seem to be all the same length? 但是,当我计算清单的长度时,它们的长度似乎都一样吗? Could anyone shed light on what I'm doing wrong here? 谁能说明我在这里做错了什么?

  1. My second question is it correct for the line 'print clf.predict()' to read 'print clf.predict(test_dataset)' (ie, a list of sentences, with no features or classes attached, that I want to assigned to class 0 or 1; I can't test this at the minute as I can't seem to get past my error from question 1). 我的第二个问题是,对于'print clf.predict()'行读取为'print clf.predict(test_dataset)'是否正确(即,我希望分配给类的句子列表,没有附加功能或类) 0或1;由于无法似乎无法解决问题1)的错误,因此我目前无法对此进行测试。

  2. As a side note, once I can eventually get this to work, it would be great to somehow work out the accuracy of the predictor. 附带说明,一旦我最终可以使它起作用,以某种方式确定预测变量的准确性将是很棒的。 However I'm struggling to get the basics to work first. 但是,我正在努力使基础知识首先起作用。

Edit 1: Revised Script 编辑1:修改后的脚本

import numpy as np
from sklearn.naive_bayes import BernoulliNB
import sys

sentence = []
feature1 = []
feature2 = []
feature3 = []
feature4 = []
feature5 = []
class_name = []

for line in open(sys.argv[1]):
    line = line.strip().split('\t')
    sentence.append(line[0])
    feature1.append(int(line[1]))
    feature2.append(int(line[2]))
    feature3.append(int(line[3]))
    feature4.append(int(line[4]))
    feature5.append(int(line[5]))
    class_name.append(int(line[6]))

print feature1
print feature2
print feature3
print feature4
print feature5
print class_name


list_of_features = [feature1,feature2,feature3,feature4,feature5]
transpos_list_of_features = np.array(list_of_features).T
clf = BernoulliNB()
print clf.fit(transpos_list_of_features,class_name)
#print clf.predict(??)

The output: 输出:

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 1, 1, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

1- There are couple of issues here. 1-这里有几个问题。

  • First, make sure you parse the file correctly. 首先,请确保您正确解析文件。 If your train file is exactly like above lines, then you may want to skip the first line. 如果您的火车文件与以上几行完全一样,那么您可能要跳过第一行。 It contains the header and should not be in your X or Y matrix. 它包含标题,并且不应位于您的X或Y矩阵中。 Make sure feature and class_name variables contains what you want. 确保feature和class_name变量包含所需的内容。 You can check them by printing. 您可以通过打印检查它们。
  • sentence.append(line[0]) I guess, you are getting string '0' or '1' instead of integer values. sentence.append(line[0])我猜想,您得到的是字符串'0'或'1',而不是整数值。 I don't think this scikit module can work with string values. 我认为此scikit模块不能与字符串值一起使用。 You should cast them to integer. 您应该将它们转换为整数。 It could be something like sentence.append(int(line[0])) 可能是sentence.append(int(line[0]))
  • list_of_features variable is no_of_features x no_of_features matrix. list_of_features变量是no_of_features x no_of_features矩阵。 It's shape should be n_samples x n_features. 它的形状应为n_samples x n_features。 You can transpose it by list_of_features = np.array(list_of_features).T 您可以通过list_of_features = np.array(list_of_features).T转置它

2 - Classifier has no idea how to map sentences to features so you have to give features explicitly. 2-分类器不知道如何将句子映射到要素,因此您必须明确地赋予要素。 You can achieve this by traversing sentence and checking whether target words exist. 您可以通过遍历句子并检查目标词是否存在来实现此目的。

Edit: 编辑:

import numpy as np
from sklearn.naive_bayes import BernoulliNB

feature_word_list = ["cat", "dog", "fridge", "car", "rabbit"]
feature1 = [0, 1, 0, 0, 0]
feature2 = [1, 0, 0, 0, 0]
feature3 = [0, 0, 1, 0, 0]
feature4 = [0, 0, 0, 1, 0]
feature5 = [1, 1, 0, 0, 1]
class_name_list = [1, 1, 0, 0, 1]

train_features = np.array([feature1,feature2,feature3,feature4,feature5]).T

clf = BernoulliNB()
clf.fit(train_features, class_name_list)

Code above is the same except I put feature values directly without reading from file. 上面的代码是相同的,除了我直接放置特征值而不从文件中读取。

test_data = ["this is about dog and cats","not animal related sentence"]
test_feature_list = []
for test_instance in test_data:
  test_feature = [1 if feature_word in test_instance else 0 for feature_word in feature_word_list]
  test_feature_list.append(test_feature) 

test_feature_matrix = np.array(test_feature_list)
print(test_feature_matrix)

Now your test_feature_matrix would look like this: 现在,您的test_feature_matrix如下所示:

[[1 1 0 0 0]
 [0 0 0 0 0]]

Note that I have 2 test data so matrix has 2 corresponding rows and each column represents a feature value (ie whether a particular word exists in the sentence or not). 请注意,我有2个测试数据,因此矩阵具有2个对应的行,每列表示一个特征值(即,句子中是否存在特定单词)。 That is what I tried to say in point 2, classifier does not know "cat", "fridge" or something else but what it needs is whether word exists or not, 1 or 0. 这就是我在第2点试图说的,分类器不知道“ cat”,“ fridge”或其他名称,但是需要的是是否存在单词1或0。

Now you can predict labels of these test data (sentences): 现在,您可以预测这些测试数据(句子)的标签:

predicted_label_list = clf.predict(test_feature_matrix)
print(predicted_label_list)

which gives the result of 这给出了结果

[1 0]

Note: It may not work with your test data since it contains words that are not in your feature space and training data. 注意:它可能无法与您的测试数据一起使用,因为它包含特征空间和训练数据中没有的单词。 What I mean is your test data contains "zebra" but there is no "zebra" in training set so it may classified as 0. 我的意思是您的测试数据包含“斑马”,但训练集中没有“斑马”,因此它可能分类为0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM