简体   繁体   English

如何使用 KenLM 计算困惑度?

[英]How to compute perplexity using KenLM?

Let's say we build a model on this:假设我们在此基础上构建了一个模型:

$ wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
$ lmplz -o 5 < something.txt > something.arpa

From the perplexity formula ( https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf )从困惑公式( https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Applying the sum of inverse log formula to get the inner variable and then taking the nth root, the perplexity number is unusually small:应用逆对数公式之和得到内部变量,然后取第n个根,困惑数异常小:

>>> import kenlm
>>> m = kenlm.Model('something.arpa')

# Sentence seen in data.
>>> s = 'The development of a forward-looking and comprehensive European migration policy,'
>>> list(m.full_scores(s))
[(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]
>>> n = len(s.split())
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> math.pow(sum_inv_logs, 1.0/n)
1.2536033936438895

Trying again with a sentence not found in the data:用数据中未找到的句子再试一次:

# Sentence not seen in data.
>>> s = 'The European developement of a forward-looking and comphrensive society is doh.'
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
35.59524390101433
>>> n = len(s.split())
>>> math.pow(sum_inv_logs, 1.0/n)
1.383679905428275

And trying again with totally out of domain data:并再次尝试完全域外数据:

>>> s = """On the evening of 5 May 2017, just before the French Presidential Election on 7 May, it was reported that nine gigabytes of Macron's campaign emails had been anonymously posted to Pastebin, a document-sharing site. In a statement on the same evening, Macron's political movement, En Marche!, said: "The En Marche! Movement has been the victim of a massive and co-ordinated hack this evening which has given rise to the diffusion on social media of various internal information"""
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
282.61719834804535
>>> n = len(list(m.full_scores(s)))
>>> n
79
>>> math.pow(sum_inv_logs, 1.0/n)
1.0740582373271952

Although, it is expected that the longer sentence has lower perplexity, it's strange that the difference is less than 1.0 and in the range of decimals.虽然,期望较长的句子具有较低的困惑度,但奇怪的是差异小于1.0并且在小数范围内。

Is the above the right way to compute perplexity with KenLM?以上是使用 KenLM 计算困惑度的正确方法吗? If not, does anyone know how to computer perplexity with the KenLM through the Python API?如果没有,有没有人知道如何通过 Python API 计算 KenLM 的困惑?

See https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L182https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L182

import kenlm

model=kenlm.Model("something.arpa") 
per=model.perplexity("your text sentance")

print(per)

The perplexity formula is:困惑度公式为:

在此处输入图片说明

But that's taking the raw probability, so in code:但这是原始概率,所以在代码中:

 import numpy as np
 import kenlm
 m = kenlm.Model('something.arpa')
 # Because the score is in log base 10, so:
 product_inv_prob = np.prod([math.pow(10.0, score) for score, _, _ in m.full_scores(s)])
 n = len(list(m.full_scores(s)))
 perplexity = math.pow(product_inv_prob, 1.0/n)

Or using the log (base 10) prob directly:或者直接使用 log (base 10) prob:

 sum_inv_logprob = -1 * sum(score for score, _, _ in m.full_scores(s))
 n = len(list(m.full_scores(s)))
 perplexity = math.pow(10.0, sum_inv_logs / n)

Source: https://www.mail-archive.com/moses-support@mit.edu/msg15341.html来源: https : //www.mail-archive.com/moses-support@mit.edu/msg15341.html

Just want to comment on alvas's answer that只想评论阿尔瓦斯的回答

sum_inv_logprob = sum(score for score, _, _ in m.full_scores(s))

Should actually be:实际上应该是:

sum_inv_logprob = -1.0 * sum(score for score, _, _ in m.full_scores(s))

you can simply use你可以简单地使用

import numpy as np
import kenlm
m = kenlm.Model('something.arpa')
ppl = m.perplexity('something')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM