简体   繁体   English

为什么填充词汇的困惑对于nltk.lm bigram不定式?

[英]Why perplexity for padded vocabulary is infinitive for nltk.lm bigram?

I am testing the perplexity measure for a language model for a text: 我正在测试文本语言模型的perplexity程度:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent_tokenize(test_text)

  train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

  test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

  from nltk.lm.preprocessing import padded_everygram_pipeline
  from nltk.lm import MLE,Laplace
  from nltk.lm import Vocabulary

  vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

  n = 2
  print(train_tokenized_text)
  print(len(train_tokenized_text))
  train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

  # print(list(vocab),"\n >>>>",list(padded_vocab))
  model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
  # model.fit(train_data, padded_vocab)
  model.fit(train_data, vocab)

  sentences = test_sentences
  print("len: ",len(sentences))
  print("per all", model.perplexity(test_text)) 

When I use vocab in model.fit(train_data, vocab) the perplexity in print("per all", model.perplexity(test_text)) is a number ( 30.2 ), but if I use padded_vocab which has additional <s> and </s> it prints inf . 当我在model.fit(train_data, vocab)使用vocab时, print("per all", model.perplexity(test_text))中的print("per all", model.perplexity(test_text))是一个数字( 30.2 ),但是如果我使用的padded_vocab具有附加的<s></s>打印inf

The input to perplexity is text in ngrams not a list of strings. 困惑的输入是以ngram表示的文本,而不是字符串列表。 You can verify the same by running 您可以通过运行来验证相同

for x in test_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

You should see that the tokens(ngrams) are all wrong. 您应该看到标记(ngrams)都是错误的。

You will still get inf in the perplexity if your words in test data are out of vocab (of train data) 如果您的测试数据中的单词超出词汇表(火车数据),您仍然会感到困惑

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM