[英]How to get most informative features for scikit-learn classifiers?
[英]How to interpret the “most informative features” in ntlk package
我是NLP的新手,并且在解释最重要功能的NLP分类的简单示例时,我难以解释我得到的结果。 具体来说,在下面显示的常见示例中,我不明白为什么“ this”一词在3/5个否定情感句子和3/5个肯定句中出现时能提供信息?
train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
from nltk.tokenize import word_tokenize # or use some other tokenizer
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
import nltk
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()
结果如下:
Most Informative Features
this = True neg : pos = 2.3 : 1.0
this = False pos : neg = 1.8 : 1.0
an = False neg : pos = 1.6 : 1.0
. = False neg : pos = 1.4 : 1.0
. = True pos : neg = 1.4 : 1.0
feel = False neg : pos = 1.2 : 1.0
of = False pos : neg = 1.2 : 1.0
not = False pos : neg = 1.2 : 1.0
do = False pos : neg = 1.2 : 1.0
very = False neg : pos = 1.2 : 1.0
有任何想法吗? 我希望解释一下计算单词的概率/信息量的公式是什么。
我也做了这个超级简单的例子:
train = [('love', 'pos'),
('love', 'pos'),
('love', 'pos'),
('bad', 'pos'),
("bad", 'pos'),
('bad', 'neg'),
('bad', 'neg'),
("bad", 'neg'),
('bad', 'neg'),
('love', 'neg')]
并获得以下信息:
Most Informative Features
bad = False pos : neg = 2.3 : 1.0
love = True pos : neg = 2.3 : 1.0
love = False neg : pos = 1.8 : 1.0
bad = True neg : pos = 1.8 : 1.0
我可以找出哪个方向正确似乎与任何似然比计算都不匹配。
从nltk的文档中show_most_informative_features()
的来源,
对于任何标签,特征的信息性
(fname,fval)
等于P(fname = fval | label)的最大值除以对于任何标签P(fname = fval | label)的最小值。
但是,就您的情况而言,根本没有足够的数据点来计算这些概率,即,概率分布几乎是平坦的,您可以从要素的原始权重值中看到。 这可能就是为什么无关的功能被标记为最重要的原因。 如果您仅添加3-4个句子进行实验,就会注意到这种变化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.