如何解析上下文无关语法中的特殊字符？

Question

I have a context free grammar (CFG) which involves punctuation.我有一个涉及标点符号的上下文无关语法（CFG）。 eg nltk.parse_cfg("""PP-CLR -> IN `` NP-TTL""")例如 nltk.parse_cfg("""PP-CLR -> IN `` NP-TTL""")

The `` is a valid Penn Treebank POS tag. `` 是有效的 Penn Treebank POS 标签。 But nltk cannot recognize it.但是nltk无法识别。 In fact, nltk.parse_cfg cannot recognize any character other than alphanumeric and dash.事实上，nltk.parse_cfg 无法识别字母数字和破折号以外的任何字符。 While Penn Treebank POS tag has several punctuation, such as $ #: .而 Penn Treebank POS 标签有几个标点符号，例如 $ #: 。 ( (

Then, should I keep the punctuation in my dataset?那么，我应该在我的数据集中保留标点符号吗？ Or is there anyway to parse these characters?或者有没有办法解析这些字符？

Thanks谢谢

Answer 1

You might need to specially specify them as terminal notes, for eg : 您可能需要特别指定它们作为终端注释，例如：

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... VP -> V PUNCT
... PUNCT -> '.'
... V -> 'eat'
... NP -> 'I'
... """)
>>> 
>>> sentence = "I eat .".split()
>>> cp = nltk.ChartParser(grammar)
>>> for tree in cp.nbest_parse(sentence):
...     print tree
... 
(S (NP I) (VP (V eat) (PUNCT .)))

Answer 2

For people using the current generation of NLTK, you can add Non-Terminals that include special characters by manually updating the set of productions of the grammar object. Below, I added the tag/non-terminal PRP$ which contains the special character $对于使用当前一代 NLTK 的人，您可以通过手动更新语法 object 的产生式集来添加包含特殊字符的非终端。下面，我添加了包含特殊字符$的标记/非终端PRP$ $

from nltk.grammar import Production
from nltk.grammar import Nonterminal
productions = my_grammar.productions()
productions.extend([Production(Nonterminal('Nom'),[Nonterminal('PRP$')])])

This is equivalent to adding the following to our CFG:这相当于将以下内容添加到我们的 CFG 中：

Nom -> PRP$

Using nltk.CFG.fromstring("Nom -> PRP$") instead throws an error.使用nltk.CFG.fromstring("Nom -> PRP$")会引发错误。

如何解析上下文无关语法中的特殊字符？

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-09-25 09:09:44

解决方案2
0 2021-10-10 01:25:22

如何解析上下文无关语法中的特殊字符？

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-09-25 09:09:44

解决方案2 0 2021-10-10 01:25:22

解决方案1
1 已采纳 2013-09-25 09:09:44

解决方案2
0 2021-10-10 01:25:22