简体   繁体   English

高效的无上下文语法解析器,最好是Python友好的

[英]Efficient Context-Free Grammar parser, preferably Python-friendly

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures ( example ) and I need to do it efficiently . 我需要为我的一个项目解析一小部分英语,描述为具有(1级)特征结构的无上下文语法( 示例 ),我需要有效地完成它。

Right now I'm using NLTK 's parser which produces the right output but is very slow. 现在我正在使用NLTK的解析器,它产生正确的输出,但速度非常慢。 For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. 对于我的约450个相当模糊的非词典规则和50万个词条的语法,解析简单的句子可能需要2到30秒,这取决于所得到的树的数量。 Lexical entries have little to no effect on performance. 词条对性能几乎没有影响。

Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute. 另一个问题是在开始时加载(25MB)语法+词典可能需要一分钟。

From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. 从我在文献中可以找到的,用于解析这种语法(Earley或CKY)的算法的运行时间应该与语法的大小呈线性关系,并且应该与输入令牌列表的大小相关。 My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar. 我对NLTK的体验表明,歧义是最能伤害表现的,而不是语法的绝对大小。

So now I'm looking for a CFG parser to replace NLTK. 所以现在我正在寻找一个CFG解析器来取代NLTK。 I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. 我一直在考虑PLY,但我不知道它是否支持CFG中的特征结构,这在我的情况下是必需的,我看到的例子似乎是在进行大量的过程解析,而不仅仅是指定语法。 Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar? 有人能告诉我一个PLY的例子,它既支持功能结构又使用声明性语法?

I'm also fine with any other parser that can do what I need efficiently. 对于能够有效地完成我需要的任何其他解析器,我也没问题。 A Python interface is preferable but not absolutely necessary. Python接口是首选,但不是绝对必要的。

By all means take a look at Pyparsing . 一定要看看Pyparsing It's the most pythonic implementations of parsing I've come across, and it's a great design from a purely academic standpoint. 这是我遇到的解析最疯狂的实现,从纯粹的学术角度来看,这是一个很棒的设计。

I used both ANTLR and JavaCC to teach translator and compiler theory at a local university. 我使用ANTLRJavaCC在当地大学教授翻译和编译理论。 They're both good and mature, but I wouldn't use them in a Python project. 它们既好又成熟,但我不会在Python项目中使用它们。

That said, unlike programming languages, natural languages are much more about the semantics than about the syntax, so you could be much better off skipping the learning curves of existing parsing tools, going with a home-brewed (top-down, backtracking, unlimited lookahead) lexical analyzer and parser, and spending the bulk of your time writing the code that figures out what a parsed, but ambiguous, natural-language sentence means. 也就是说,与编程语言不同,自然语言更多地是关于语义而不是语法,所以你可以更好地跳过现有解析工具的学习曲线,与家庭酿造(自上而下,回溯,无限制)前瞻性的词法分析器和解析器,并花费大量时间编写代码,找出解析但含糊不清的自然语句的含义。

Tooling aside... 除了工具......

You may remember from theory that there are infinite grammars that define the same language. 您可能从理论上记得有无限语法定义相同的语言。 There are criteria for categorizing grammars and determining which is the "canonical" or "minimal" one for a given language, but in the end, the "best" grammar is the one that's more convenient for the task and tools at hand (remember the transformations of CFGs into LL and LR grammars?). 对于给定语言,存在对语法进行分类和确定哪个是“规范”或“最小”语法的标准,但最终,“最佳”语法是对于手头的任务和工具更方便的语法(记住将CFG转换为LL和LR语法?)。

Then, you probably don't need a huge lexicon to parse an sentence in English. 然后,你可能不需要一个巨大的词汇来用英语解析一个句子。 There's a lot to be known about a word in languages like German or Latin (or even Spanish), but not in the many times arbitrary and ambiguos English. 关于德语或拉丁语(甚至西班牙语)等语言中的单词有很多值得一提的地方,但在很多时候没有任意的和ambiguos英语。 You should be able to get away with a small lexicon that contains only the key words necessary to arrive to the structure of a sentence. 你应该能够使用一个小词典,它只包含到达句子结构所需的关键词。 At any rate, the grammar you choose, no matter its size, can be cached in a way that thee tooling can directly use it (ie, you can skip parsing the grammar). 无论如何,您选择的语法,无论其大小,都可以以工具可以直接使用它的方式进行缓存(即,您可以跳过解析语法)。

Given that, it could be a good idea to take a look at a simpler parser already worked on by someone else. 鉴于此,看一看其他人已经使用的更简单的解析器可能是一个好主意。 There must be thousands of those in the literature. 文献中必须有成千上万的人。 Studying different approaches will let you evaluate your own, and may lead you to adopt someone else's. 学习不同的方法可以让你评估自己的方法,并可能导致你采用别人的方法。

Finally, as I already mentioned, interpreting natural languages is much more about artificial intelligence than about parsing. 最后,正如我已经提到的,解释自然语言更多是关于人工智能而不是解析。 Because structure determines meaning and meaning determines structure you have to play with both at the same time. 因为结构决定意义和意义决定结构,你必须同时玩两者。 An approach I've seen in the literature since the '80s is to let different specialized agents take shots at solving the problem against a " blackboard ". 自80年代以来,我在文献中看到的一种方法是让不同的专业代理人针对“ 黑板 ”解决问题。 With that approach syntatic and semantic analysis happen concurrently. 通过这种方法,合成和语义分析同时发生。

I would recommend using bitpar, a very efficient PCFG parser written in C++. 我建议使用bitpar,一种用C ++编写的非常高效的PCFG解析器。 I've written a shell-based Python wrapper for it, see https://github.com/andreasvc/eodop/blob/master/bitpar.py 我为它编写了一个基于shell的Python包装器,请参阅https://github.com/andreasvc/eodop/blob/master/bitpar.py

I've used pyparsing for limited vocabulary command parsing, but here is a little framework on top of pyparsing that addresses your posted example: 我已经使用pyparsing进行有限的词汇表命令解析,但是这里有一个基于pyparsing的小框架来解决你发布的示例:

from pyparsing import *

transVerb, transVerbPlural, transVerbPast, transVerbProg = (Forward() for i in range(4))
intransVerb, intransVerbPlural, intransVerbPast, intransVerbProg = (Forward() for i in range(4))
singNoun,pluralNoun,properNoun = (Forward() for i in range(3))
singArticle,pluralArticle = (Forward() for i in range(2))
verbProg = transVerbProg | intransVerbProg
verbPlural = transVerbPlural | intransVerbPlural

for expr in (transVerb, transVerbPlural, transVerbPast, transVerbProg,
            intransVerb, intransVerbPlural, intransVerbPast, intransVerbProg,
            singNoun, pluralNoun, properNoun, singArticle, pluralArticle):
    expr << MatchFirst([])

def appendExpr(e1, s):
    c1 = s[0]
    e2 = Regex(r"[%s%s]%s\b" % (c1.upper(), c1.lower(), s[1:]))
    e1.expr.exprs.append(e2)

def makeVerb(s, transitive):
    v_pl, v_sg, v_past, v_prog = s.split()
    if transitive:
        appendExpr(transVerb, v_sg)
        appendExpr(transVerbPlural, v_pl)
        appendExpr(transVerbPast, v_past)
        appendExpr(transVerbProg, v_prog)
    else:
        appendExpr(intransVerb, v_sg)
        appendExpr(intransVerbPlural, v_pl)
        appendExpr(intransVerbPast, v_past)
        appendExpr(intransVerbProg, v_prog)

def makeNoun(s, proper=False):
    if proper:
        appendExpr(properNoun, s)
    else:
        n_sg,n_pl = (s.split() + [s+"s"])[:2]
        appendExpr(singNoun, n_sg)
        appendExpr(pluralNoun, n_pl)

def makeArticle(s, plural=False):
    for ss in s.split():
        if not plural:
            appendExpr(singArticle, ss)
        else:
            appendExpr(pluralArticle, ss)

makeVerb("disappear disappears disappeared disappearing", transitive=False)
makeVerb("walk walks walked walking", transitive=False)
makeVerb("see sees saw seeing", transitive=True)
makeVerb("like likes liked liking", transitive=True)

makeNoun("dog")
makeNoun("girl")
makeNoun("car")
makeNoun("child children")
makeNoun("Kim", proper=True)
makeNoun("Jody", proper=True)

makeArticle("a the")
makeArticle("this every")
makeArticle("the these all some several", plural=True)

transObject = (singArticle + singNoun | properNoun | Optional(pluralArticle) + pluralNoun | verbProg | "to" + verbPlural)
sgSentence = (singArticle + singNoun | properNoun) + (intransVerb | intransVerbPast | (transVerb | transVerbPast) + transObject)
plSentence = (Optional(pluralArticle) + pluralNoun) + (intransVerbPlural | intransVerbPast | (transVerbPlural |transVerbPast) + transObject)

sentence = sgSentence | plSentence


def test(s):
    print s
    try:
        print sentence.parseString(s).asList()
    except ParseException, pe:
        print pe

test("Kim likes cars")
test("The girl saw the dog")
test("The dog saw Jody")
test("Kim likes walking")
test("Every girl likes dogs")
test("All dogs like children")
test("Jody likes to walk")
test("Dogs like walking")
test("All dogs like walking")
test("Every child likes Jody")

Prints: 打印:

Kim likes cars
['Kim', 'likes', 'cars']
The girl saw the dog
['The', 'girl', 'saw', 'the', 'dog']
The dog saw Jody
['The', 'dog', 'saw', 'Jody']
Kim likes walking
['Kim', 'likes', 'walking']
Every girl likes dogs
['Every', 'girl', 'likes', 'dogs']
All dogs like children
['All', 'dogs', 'like', 'children']
Jody likes to walk
['Jody', 'likes', 'to', 'walk']
Dogs like walking
['Dogs', 'like', 'walking']
All dogs like walking
['All', 'dogs', 'like', 'walking']
Every child likes Jody
['Every', 'child', 'likes', 'Jody']

This is likely to get slow as you expand the vocabulary. 随着您扩展词汇量,这可能会变慢。 Half a million entries? 五十万条记录? I thought that a reasonable functional vocabulary was on the order of 5-6 thousand words. 我认为合理的功能词汇量大约为5-6千字。 And you will be pretty limited in the sentence structures that you can handle - natural language is what NLTK is for. 你可以处理的句子结构非常有限 - 自然语言是NLTK的用途。

Somewhat late on this, but here are two more options for you: 有点晚了,但这里还有两个选项:

Spark is a Earley parser written in Python. Spark是一个用Python编写的Earley解析器。

Elkhound is a GLR parser written in C++ Elkhound uses a Bison like syntax Elkhound是一个用C ++编写的GLR解析器Elkhound使用类似Bison的语法

I think ANTLR is the best parser-generator that I know of for Java. 我认为ANTLR是我所知道的最适合Java的解析器生成器。 I don't know if Jython would provide you a good way for Python and Java to interact. 我不知道Jython是否会为Python和Java提供一个很好的交互方式。

If it can be expressed as a PEG language (I don't think all CFGs can, but supposedly many can), then you might use pyPEG , which is supposed to be linear-time when using a packrat parsing implementation (although potentially prohibitive on memory usage). 如果它可以表示为PEG语言(我不认为所有的CFG都可以,但据说很多都可以),那么你可以使用pyPEG ,这在使用packrat解析实现时应该是线性时间(尽管可能过高内存使用情况)。

I don't have any experience with it as I am just starting to research parsing and compilation again after a long time away from it, but I am reading some good buzz about this relatively up-to-date technique. 我没有任何经验,因为我在很长一段时间后才开始研究解析和编译,但我正在阅读一些关于这种相对最新技术的好评。 YMMV. 因人而异。

尝试在PyPy上运行它,它可能会快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM