简体   繁体   English

NLTK包估计(unigram)困惑

[英]NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. 我试图计算我所拥有的数据的困惑。 The code I am using is: 我使用的代码是:

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

But I am receiving the error, 但我收到错误,

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1). 我已经为我的数据执行了潜在Dirichlet分配,并且我已经生成了unigrams及其各自的概率(它们被归一化为数据的总概率之和为1)。

My unigrams and their probability looks like: 我的unigrams和他们的概率看起来像:

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

This is just a fragment of the unigrams file I have. 这只是我所拥有的unigrams文件的一个片段。 The same format is followed for about 1000s of lines. 对于大约1000行,遵循相同的格式。 The total probabilities (second column) summed gives 1. 总概率(第二列)求和1。

I am a budding programmer. 我是一个崭露头角的程序员。 This ngram.py belongs to the nltk package and I am confused as to how to rectify this. 这个ngram.py属于nltk包,我很困惑如何纠正这个问题。 The sample code I have here is from the nltk documentation and I don't know what to do now. 我这里的示例代码来自nltk文档,我现在不知道该怎么做。 Please help on what I can do. 请帮忙我能做些什么。 Thanks in advance! 提前致谢!

Perplexity is the inverse probability of the test set, normalized by the number of words. 困惑是测试集的反向概率,由字数标准化。 In the case of unigrams: 在unigrams的情况下:

在此输入图像描述

Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. 现在你说你已经构建了unigram模型,意思是,对于每个单词你都有相关的概率。 Then you only need to apply the formula. 然后你只需要应用公式。 I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. 我假设你有一个大字典unigram[word] ,它将提供语料库中每个单词的概率。 You also need to have a test set. 您还需要一个测试集。 If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly. 如果你的unigram模型不是字典的形式,告诉我你使用了什么数据结构,所以我可以相应地调整它到我的解决方案。

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

UPDATE: 更新:

As you asked for a complete working example, here's a very simple one. 当你要求一个完整的工作示例时,这是一个非常简单的例子。

Suppose this is our corpus: 假设这是我们的语料库:

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

Here's how we construct the unigram model first: 以下是我们首先构建unigram模型的方法:

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

Our model here is smoothed. 我们的模型在这里很平滑。 For words outside the scope of its knowledge, it assigns a low probability of 0.01 . 对于超出其知识范围的词语,它指定的概率为0.01 I already told you how to compute perplexity: 我已经告诉过你如何计算困惑:

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

Now we can test this on two different test sets: 现在我们可以在两个不同的测试集上测试它:

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

for which you get the following result: 您将获得以下结果:

>>> 
49.09452736318415
99.99999999999997

Note that when dealing with perplexity, we try to reduce it. 请注意,在处理困惑时,我们会尝试减少它。 A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. 对于某个测试集而言较少困惑的语言模型比具有较大困惑的模型更令人满意。 In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller. 在第一个测试集中,单词Monty包含在unigram模型中,因此困惑的相应数字也较小。

Thanks for the code snippet! 感谢您的代码片段! Shouldn't: 不应该:

for word in model:
        model[word] = model[word]/float(sum(model.values()))

be rather: 相反:

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

Oh ... I see was already answered ... 哦......我看到已经回答了......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM