[英]How to find perplexity of bigram if probability of given bigram is 0
Given the formula to calculate the perplexity of a bigram (and probability with add-1 smoothing),给定计算二元组困惑度的公式(以及加 1 平滑的概率),
How does one proceed when one of the probabilities of the word per in the sentence to predict is 0?当句子中单词 per 的预测概率之一为 0 时,如何进行?
# just examples, don't mind the counts
corpus_bigram = {'<s> now': 2, 'now is': 1, 'is as': 6, 'as one': 1, 'one mordant': 1, 'mordant </s>': 5}
word_dict = {'<s>': 2, 'now': 1, 'is': 6, 'as': 1, 'one': 1, 'mordant': 5, '</s>': 5}
test_bigram = {'<s> now': 2, 'now <UNK>': 1, '<UNK> as': 6, 'as </s>': 5}
n = 1 # Add one smoothing
probabilities = {}
for bigram in test_bigram:
if bigram in corpus_bigram:
value = corpus_bigram[bigram]
first_word = bigram.split()[0]
probabilities[bigram] = (value + n) / (word_dict.get(first_word) + (n * len(word_dict)))
else:
probabilities[bigram] = 0
If for instance, the probabilities of the test_bigram
come out as例如,如果
test_bigram
的概率为
# Again just dummy probability values
probabilities = {{'<s> now': 0.35332322, 'now <UNK>': 0, '<UNK> as': 0, 'as </s>': 0.632782318}}
perplexity = 1
for key in probabilities:
# when probabilities[key] == 0 ????
perplexity = perplexity * (1 / probabilities[key])
N = len(sentence)
perplexity = pow(perplexity, 1 / N)
ZeroDivisionError: division by zero
ZeroDivisionError:除以零
The common solution is to assign words that don't occur a small probability, eg 1/N ,with N being the number of words in total.常见的解决方案是分配不会出现小概率的单词,例如1/N ,其中N是单词的总数。 So you pretend that a word that didn't occur in your data did occur once;
因此,您假装数据中未出现的单词确实出现过一次; that introduces only a minor error, but stops divisions by zero.
这只会引入一个小错误,但会停止除以零。
So in your case, probabilities[bigram] = 1 / <sum of all bigram frequencies>
所以在你的情况下,
probabilities[bigram] = 1 / <sum of all bigram frequencies>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.