简体   繁体   English

NLTK ViterbiParser无法解析不在PCFG规则中的单词

[英]NLTK ViterbiParser fails in parsing words that are not in the PCFG rule

import nltk
from nltk.parse import ViterbiParser

def pcfg_chartparser(grammarfile):
    f=open(grammarfile)
    grammar=f.read()
    f.close()
    return nltk.PCFG.fromstring(grammar)

grammarp = pcfg_chartparser("wsjp.cfg")

VP = ViterbiParser(grammarp)
print VP
for w in sent:
    for tree in VP.parse(nltk.word_tokenize(w)):
        print tree

When I run the above code, it produces the following output for the sentence, "turn off the lights"- 当我运行上面的代码时,它为句子产生以下输出,“关灯” -

(S (VP (VB turn) (PRT (RP off)) (NP (DT the) (NNS lights)))) (p=2.53851e-14) (S(VP(VB转)(PRT(RP关))(NP(DT)(NNS灯))))(p = 2.53851e-14)

However, it raises the following error for the sentence, "please turn off the lights"- 但是,它会引起句子的以下错误,“请关掉灯” -

ValueError: Grammar does not cover some of the input words: u"'please'" ValueError:语法不包含一些输入词:u“'please'”

I am building a ViterbiParser by supplying it a probabilistic context free grammar. 我正在通过提供概率上下文无关语法来构建ViterbiParser。 It works well in parsing sentences that have words which are already in the rules of the grammar. 它适用于解析具有已经在语法规则中的单词的句子。 It fails to parse sentences in which the Parser has not seen the word in the grammar rules. 它无法解析Parser在语法规则中没有看到单词的句子。 How to get around this limitation? 如何解决这个限制?
I am referring to this assignment . 我指的是这个任务

Firstly, try to use (i) namespaces and (ii) unequivocal variable names, eg: 首先,尝试使用(i)名称空间和(ii)明确的变量名称,例如:

>>> from nltk import PCFG
>>> from nltk.parse import ViterbiParser
>>> import urllib.request
>>> response = urllib.request.urlopen('https://raw.githubusercontent.com/salmanahmad/6.863/master/Labs/Assignment5/Code/wsjp.cfg')
>>> wsjp = response.read().decode('utf8')
>>> grammar = PCFG.fromstring(wsjp)
>>> parser = ViterbiParser(grammar)
>>> list(parser.parse('turn off the lights'.split()))
[ProbabilisticTree('S', [ProbabilisticTree('VP', [ProbabilisticTree('VB', ['turn']) (p=0.002082678), ProbabilisticTree('PRT', [ProbabilisticTree('RP', ['off']) (p=0.1089101771)]) (p=0.10768769667270556), ProbabilisticTree('NP', [ProbabilisticTree('DT', ['the']) (p=0.7396712852), ProbabilisticTree('NNS', ['lights']) (p=4.61672e-05)]) (p=4.4236397464693323e-07)]) (p=1.0999324002161311e-13)]) (p=2.5385077255727538e-14)]

If we look at the grammar: 如果我们看一下语法:

>>> grammar.check_coverage('please turn off the lights'.split())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/nltk/grammar.py", line 631, in check_coverage
    "input words: %r." % missing)
ValueError: Grammar does not cover some of the input words: "'please'".

To resolve the unknown word issues, there're several options : 要解决未知单词问题,有几种选择

  • Use wildcard non-terminals nodes to replace the unknown words . 使用wildcard非终端节点替换未知单词 Find some way to replace the words that the grammar don't cover from check_coverage() with the wildcard , then parse the sentence with the wildcard 找到一些方法用wildcard check_coverage()语法没有覆盖的check_coverage() ,然后使用wildcard解析句子

    • this will usually decrease the parser's accuracy unless you have specifically train the PCFG with a grammar that handles unknown words and the wildcard is a superset of the unknown words. 这通常会降低解析器的准确性,除非您专门训练PCFG使用处理未知单词的语法,并且通配符是未知单词的超集。
  • Go back to your grammar production file that you have before creating the learning the PCFG with learn_pcfg.py and add all possible words in the terminal productions . 在使用learn_pcfg.py创建学习PCFG之前,回到您的语法生成文件,并在终端制作中添加所有可能的单词

  • Add the unknown words into your pcfg grammar and then renormalize the weights , given either very small weights to the unknown words (you can also try smarter smoothing/interpolation techniques) 将未知单词添加到您的pcfg语法中,然后重新归一化权重 ,给予未知单词非常小的权重(您还可以尝试更智能的平滑/插值技术)

Since this is a homework question I will not give the answer with the full code. 由于这是一个家庭作业问题,我不会用完整的代码给出答案。 But the hints above should be enough to resolve the problem. 但上述提示应足以解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM