如何解释ntlk包中的“信息最丰富”

Question

I'm new to NLP and am struggling to interpret the results I get when looking at a simple example of NLP classification of most important features. 我是NLP的新手，并且在解释最重要功能的NLP分类的简单示例时，我难以解释我得到的结果。 Specifically, in the common example I've shown below, I don't understand why the word "this" is informative when it appears in 3/5 negative sentiment sentences, and 3/5 positive sentences? 具体来说，在下面显示的常见示例中，我不明白为什么“ this”一词在3/5个否定情感句子和3/5个肯定句中出现时能提供信息？

train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

from nltk.tokenize import word_tokenize # or use some other tokenizer
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

import nltk
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()

Here are the results: 结果如下：

Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                     not = False             pos : neg    =      1.2 : 1.0
                      do = False             pos : neg    =      1.2 : 1.0
                    very = False             neg : pos    =      1.2 : 1.0

Any ideas? 有任何想法吗？ I'd love an explanation of what the formula is that calculates the probability of a word / its informativeness. 我希望解释一下计算单词的概率/信息量的公式是什么。

I also did this super simple example: 我也做了这个超级简单的例子：

train = [('love', 'pos'),
('love', 'pos'),
('love', 'pos'),
('bad', 'pos'),
("bad", 'pos'),
('bad', 'neg'),
('bad', 'neg'),
("bad", 'neg'),
('bad', 'neg'),
('love', 'neg')]

And get the following: 并获得以下信息：


Most Informative Features
                     bad = False             pos : neg    =      2.3 : 1.0
                    love = True              pos : neg    =      2.3 : 1.0
                    love = False             neg : pos    =      1.8 : 1.0
                     bad = True              neg : pos    =      1.8 : 1.0

Which while directionally right doesn't seem to match up with any likelihood ratio calculation I can figure out. 我可以找出哪个方向正确似乎与任何似然比计算都不匹配。

Answer 1

Looking at the source of show_most_informative_features() , from nltk's documentation , 从nltk的文档中show_most_informative_features()的来源，

Informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label. 对于任何标签，特征的信息性(fname,fval)等于P（fname = fval | label）的最大值除以对于任何标签P（fname = fval | label）的最小值。

But, for your case, there is simply not enough data points to calculate these probabilities, ie the probability distribution is more or less flat, which you can see from the raw weight values of the features. 但是，就您的情况而言，根本没有足够的数据点来计算这些概率，即，概率分布几乎是平坦的，您可以从要素的原始权重值中看到。 That is probably why irrelevant features are marked as most important. 这可能就是为什么无关的功能被标记为最重要的原因。 If you experiment by adding just 3-4 more sentences, will notice how this changes. 如果您仅添加3-4个句子进行实验，就会注意到这种变化。

如何解释ntlk包中的“信息最丰富”

问题描述

1 个解决方案

解决方案1
0 2019-09-09 05:14:09

如何解释ntlk包中的“信息最丰富”

问题描述

1 个解决方案

解决方案1 0 2019-09-09 05:14:09

解决方案1
0 2019-09-09 05:14:09