简体   繁体   English

如何解释ntlk包中的“信息最丰富”

[英]How to interpret the “most informative features” in ntlk package

I'm new to NLP and am struggling to interpret the results I get when looking at a simple example of NLP classification of most important features. 我是NLP的新手,并且在解释最重要功能的NLP分类的简单示例时,我难以解释我得到的结果。 Specifically, in the common example I've shown below, I don't understand why the word "this" is informative when it appears in 3/5 negative sentiment sentences, and 3/5 positive sentences? 具体来说,在下面显示的常见示例中,我不明白为什么“ this”一词在3/5个否定情感句子和3/5个肯定句中出现时能提供信息?

train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

from nltk.tokenize import word_tokenize # or use some other tokenizer
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

import nltk
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()

Here are the results: 结果如下:

Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                     not = False             pos : neg    =      1.2 : 1.0
                      do = False             pos : neg    =      1.2 : 1.0
                    very = False             neg : pos    =      1.2 : 1.0

Any ideas? 有任何想法吗? I'd love an explanation of what the formula is that calculates the probability of a word / its informativeness. 我希望解释一下计算单词的概率/信息量的公式是什么。

I also did this super simple example: 我也做了这个超级简单的例子:

train = [('love', 'pos'),
('love', 'pos'),
('love', 'pos'),
('bad', 'pos'),
("bad", 'pos'),
('bad', 'neg'),
('bad', 'neg'),
("bad", 'neg'),
('bad', 'neg'),
('love', 'neg')]

And get the following: 并获得以下信息:


Most Informative Features
                     bad = False             pos : neg    =      2.3 : 1.0
                    love = True              pos : neg    =      2.3 : 1.0
                    love = False             neg : pos    =      1.8 : 1.0
                     bad = True              neg : pos    =      1.8 : 1.0

Which while directionally right doesn't seem to match up with any likelihood ratio calculation I can figure out. 我可以找出哪个方向正确似乎与任何似然比计算都不匹配。

Looking at the source of show_most_informative_features() , from nltk's documentation , 从nltk的文档show_most_informative_features()的来源,

Informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label. 对于任何标签,特征的信息性(fname,fval)等于P(fname = fval | label)的最大值除以对于任何标签P(fname = fval | label)的最小值。

But, for your case, there is simply not enough data points to calculate these probabilities, ie the probability distribution is more or less flat, which you can see from the raw weight values of the features. 但是,就您的情况而言,根本没有足够的数据点来计算这些概率,即,概率分布几乎是平坦的,您可以从要素的原始权重值中看到。 That is probably why irrelevant features are marked as most important. 这可能就是为什么无关的功能被标记为最重要的原因。 If you experiment by adding just 3-4 more sentences, will notice how this changes. 如果您仅添加3-4个句子进行实验,就会注意到这种变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为 scikit-learn 分类器获取信息量最大的特征? - How to get most informative features for scikit-learn classifiers? SkLearn 多项式 NB:信息量最大的特征 - SkLearn Multinomial NB: Most Informative Features 降维后获得信息量最大的特征 - Obtaining the most informative features after dimensionality reduction 使用scikit获取最具信息性的功能的问题学习? - Problems obtaining most informative features with scikit learn? 如何使用线性支持向量机(SVM)分类器确定最重要/信息功能 - How to determine most Important/Informative features using Linear Support Vector Machines (SVM) classifier 如何将分类器最丰富的功能保存到变量中? (Python NLTK) - How do I save classifier's most informative features into a variable? (Python NLTK) 如何将结果从nltk函数“ most_informative_features”保存到Python中的txt文件中 - How to save the results from nltk function “most_informative_features” to a txt file in Python 如何为不同类别的scikit-learn分类器获取最丰富的信息? - How to get most informative features for scikit-learn classifier for different class? 将NLTK NaiveBayesClassifier中最具信息性的功能存储在列表中 - Store most informative features from NLTK NaiveBayesClassifier in a list 从非常简单的scikit-learn SVM分类器中获取最丰富的功能 - Get most informative features from very simple scikit-learn SVM classifier
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM