Ngram模型和NLTK中的困惑

Question

To put my question in context, I would like to train and test/compare several (neural) language models. 为了将我的问题放在上下文中，我想训练和测试/比较几种（神经）语言模型。 In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against). 为了专注于模型而不是数据准备，我选择使用来自nltk的布朗语料库并训练以nltk提供的Ngrams模型作为基线（比较其他LM对照）。

So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. 所以我的第一个问题实际上是关于我发现可疑的ngram模型的行为。 Since the code is rather short I pasted it here: 由于代码相当短，我在这里粘贴：

import nltk

print "... build"
brown = nltk.corpus.brown
corpus = [word.lower() for word in brown.words()]

# Train on 95% f the corpus and test on the rest
spl = 95*len(corpus)/100
train = corpus[:spl]
test = corpus[spl:]

# Remove rare words from the corpus
fdist = nltk.FreqDist(w for w in train)
vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems())))

train = map(lambda x: x if x in vocabulary else "*unknown*", train)
test = map(lambda x: x if x in vocabulary else "*unknown*", test)

print "... train"
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)

print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) )
print "perplexity(test) =", lm.perplexity(test)

What I find very suspicious is that I get the following results: 我发现非常可疑的是我得到以下结果：

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

With a perplexity of 4.6 it seems Ngram modeling is very good on that corpus. 令人困惑的是，似乎Ngram的建模在该语料库上非常好。 If my interpretation is correct then the model should be able to guess the correct word in roughly 5 tries on average (although there are 13817 possibilities...). 如果我的解释是正确的，那么模型应该能够在平均大约5次尝试中猜出正确的单词（尽管有13817种可能性......）。 If you could share your experience on the value of this perplexity (I don't really believe it)? 如果你能分享你对这种困惑的价值的体验（我真的不相信）？ I did not find any complaints on the ngram model of nltk on the net ( but maybe I do it wrong). 我没有发现网上nltk的ngram模型的任何投诉（但也许我做错了）。 Do you know a good alternatives to NLTK for Ngram models and computing perplexity? 对于Ngram模型和计算困惑，你知道NLTK的一个很好的替代品吗？

Thanks! 谢谢！

Answer 1

You are getting a low perplexity because you are using a pentagram model. 你正在使用五角星模型，因此感觉很低。 If you'd use a bigram model your results will be in more regular ranges of about 50-1000 (or about 5 to 10 bits). 如果你使用的是bigram模型，你的结果将会在更常规的范围内，大约50-1000（或大约5到10位）。

Given your comments, are you using NLTK-3.0alpha? 鉴于您的意见，您使用的是NLTK-3.0alpha吗？ You shouldn't, at least not for language modeling: 你不应该，至少不是语言建模：

https://github.com/nltk/nltk/issues?labels=model https://github.com/nltk/nltk/issues?labels=model

As a matter of fact, the whole model module has been dropped from the NLTK-3.0a4 pre-release until the issues are fixed. 事实上，整个model模块已经从NLTK-3.0a4预发布中删除，直到问题得到解决。

Ngram模型和NLTK中的困惑

问题描述

1 个解决方案

解决方案1
4 2014-06-28 12:45:30

Ngram模型和NLTK中的困惑

问题描述

1 个解决方案

解决方案1 4 2014-06-28 12:45:30

解决方案1
4 2014-06-28 12:45:30