繁体   English   中英

如何解释ntlk包中的“信息最丰富”

[英]How to interpret the “most informative features” in ntlk package

我是NLP的新手,并且在解释最重要功能的NLP分类的简单示例时,我难以解释我得到的结果。 具体来说,在下面显示的常见示例中,我不明白为什么“ this”一词在3/5个否定情感句子和3/5个肯定句中出现时能提供信息?

train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

from nltk.tokenize import word_tokenize # or use some other tokenizer
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

import nltk
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()

结果如下:

Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                     not = False             pos : neg    =      1.2 : 1.0
                      do = False             pos : neg    =      1.2 : 1.0
                    very = False             neg : pos    =      1.2 : 1.0

有任何想法吗? 我希望解释一下计算单词的概率/信息量的公式是什么。

我也做了这个超级简单的例子:

train = [('love', 'pos'),
('love', 'pos'),
('love', 'pos'),
('bad', 'pos'),
("bad", 'pos'),
('bad', 'neg'),
('bad', 'neg'),
("bad", 'neg'),
('bad', 'neg'),
('love', 'neg')]

并获得以下信息:


Most Informative Features
                     bad = False             pos : neg    =      2.3 : 1.0
                    love = True              pos : neg    =      2.3 : 1.0
                    love = False             neg : pos    =      1.8 : 1.0
                     bad = True              neg : pos    =      1.8 : 1.0

我可以找出哪个方向正确似乎与任何似然比计算都不匹配。

从nltk的文档show_most_informative_features()的来源,

对于任何标签,特征的信息性(fname,fval)等于P(fname = fval | label)的最大值除以对于任何标签P(fname = fval | label)的最小值。

但是,就您的情况而言,根本没有足够的数据点来计算这些概率,即,概率分布几乎是平坦的,您可以从要素的原始权重值中看到。 这可能就是为什么无关的功能被标记为最重要的原因。 如果您仅添加3-4个句子进行实验,就会注意到这种变化。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM