简体繁体 English

水平和垂直马氏体化

[英]Horizontal and Vertical Markovization

原文 2012-11-27 09:19:22 4 1 python/ nlp/ context-free-grammar

I have a sentence along with the grammar in a tree form. 我有一个句子以及树形式的语法。 I need to train a Probabilistic Context Free Grammar from it so that I can give the best possible parse for it. 我需要从中训练一个概率上下文免费语法，以便我可以为它提供最好的解析。 I am using Viterbi CKY algorithm to get the best parse. 我正在使用Viterbi CKY算法来获得最佳解析。 The sentences are in the following tree format: (TOP (S (NP (DT The) (NN flight)) (VP (MD should) (VP (VB be) (NP (NP (CD eleven) (RB am)) (NP (NN tomorrow)))))) (PUNC .)) 句子采用以下树形式:( TOP（S（NP（DT））（NN航班））（VP（MD应）（VP（VB be）（NP（NP（CD 11））（RB am））（ NP（明天NN））））））（PUNC。））

I have built a system which from the ATIS section of the Penn Treebank has learnt a probabilistic grammar and now can give a possible parse output for the above sentence. 我已经建立了一个系统，该系统从宾夕法尼亚大学银行的ATIS部分学习了概率语法，现在可以为上述句子提供一个可能的解析输出。

I read about Horizontal and Vertical Markovization techniques which can help increase the accuracy by using annotations. 我读到了水平和垂直马氏体化技术，它可以通过使用注释来帮助提高准确性。 I am a little confused as to how they work. 我对他们的工作方式感到有点困惑。 Can someone guide me to some explanatory examples or illustrate how they work and how they effect the accuracy. 有人可以引导我一些解释性的例子或说明它们如何工作以及它们如何影响准确性。

1 个解决方案

It is worth looking at this paper by Klein and Manning: 值得看看Klein和Manning撰写的这篇论文：

http://nlp.stanford.edu/~manning/papers/unlexicalized-parsing.pdf http://nlp.stanford.edu/~manning/papers/unlexicalized-parsing.pdf

Vertical Markovization is a technique that provides context for a given rule. 垂直马尔科夫化是一种为给定规则提供上下文的技术。 From the above paper: 从上面的论文：

For example, subject NP expansions are very different from object NP expansions: a subject NP is 8.7 times more likely than an object NP to expand as just a pronoun. 例如，主题NP扩展与对象NP扩展非常不同：主题NP比对象NP扩展为仅仅代词的可能性高8.7倍。 Having separate symbols for subject and object NPs allows this variation to be captured and used to improve parse scoring. 具有主题和对象NP的单独符号允许捕获该变体并用于改进解析得分。 One way of capturing this kind of external context is to use parent annotation, as presented in Johnson (1998). 捕获这种外部上下文的一种方法是使用父注释，如Johnson（1998）中所述。 For example, NPs with S parents (like subjects) will be marked NPˆS, while NPs with VP parents (like objects) will be NPˆVP. 例如，具有S父母（如受试者）的NP将被标记为NPS，而具有VP父母（如对象）的NP将是NPVP。

By rewriting these rules with this additional parent annotation, we are adding information about the location of the rule that you are rewriting, and this additional information provides a more accurate probability of a particular rule rewrite. 通过使用此附加父注释重写这些规则，我们将添加有关您正在重写的规则的位置的信息，并且此附加信息提供特定规则重写的更准确概率。

The implementation of this is quite simple. 这个的实现非常简单。 Using the training data, start at the bottom non-terminals (these are the rules that rewrite to terminals such as DT, NNP, NN, VB, etc.) and append a ^ followed by its parent non-terminal. 使用训练数据，从底部非终端开始（这些是重写到诸如DT，NNP，NN，VB等终端的规则）并附加^后跟其父非终端。 In your example, the first rewrite would be NP^S, and so on. 在您的示例中，第一次重写将是NP ^ S，依此类推。 Continue up the tree until you have reached the TOP (which you would not rewrite). 继续向上直到你到达TOP（你不会重写）。 In your case, the final rewrite would be S^TOP. 在你的情况下，最后的重写将是S ^ TOP。 Stripping the tags on your output will give you the final parse tree. 剥离输出上的标记将为您提供最终的解析树。