简体繁体 English

不同的解析算法之间的运行时差异是什么？

[英]What is the runtime difference between different parsing algorithms?

原文 2010-10-13 10:27:21 7 2 algorithm/ parsing/ language-agnostic

There are lots of different parsing algorithms out there (recursive descent, LL(k), LR(k), LALR, ...). 那里有许多不同的解析算法（递归下降，LL（k），LR（k），LALR，......）。 I find a lot of information about the different grammars different types of parser can accept. 我发现很多关于不同语法的信息，不同类型的解析器都可以接受。 But how do they differ in runtime behavior? 但它们在运行时行为方面有何不同？ Which algorithm is faster, uses less memory or stack space? 哪种算法更快，使用更少的内存或堆栈空间？

Or to put this differently - which algorithm performs best, assuming the grammar can be formulated to work with any algorithm? 或者换句话说 - 哪种算法表现最好，假设语法可以用于任何算法？

2 个解决方案

LR parsers IMHO can be the fastest. LR解析器恕我直言可以是最快的。 Basically they use a token as an index into a lookahead set or a transition table to decide what to do next (push a state index, pop a state indexes/call a reduction routine). 基本上，他们使用令牌作为先行集或转换表的索引来决定下一步做什么（推送状态索引，弹出状态索引/调用缩减例程）。 Converted to machine code this can be just a few machine instructions. 转换为机器代码，这可能只是一些机器指令。 Pennello discusses this in detail in his paper: Pennello在他的论文中详细讨论了这个问题：

Thomas J. Pennello: Very fast LR parsing. Thomas J. Pennello：非常快速的LR解析。 SIGPLAN Symposium on Compiler Construction 1986: 145-151 SIGPLAN编译器构建研讨会1986：145-151

LL parsers involve recursive calls, which are a bit slower than just plain table lookups, but they can be pretty fast. LL解析器涉及递归调用，它比简单的表查找慢一点，但它们可以非常快。

GLR parsers are generalizations of LR parsers, and thus have to be slower than LR parsers. GLR解析器是LR解析器的推广，因此必须比LR解析器慢。 A key observation is that most of the time a GLR parser is acting exactly as an LR parser would, and one can make that part run essentially as the same speed as an LR parser, so they can be fairly fast. 一个关键的观察是，大多数时候GLR解析器的行为与LR解析器完全相同，并且可以使该部分基本上以与LR解析器相同的速度运行，因此它们可以相当快。

Your parser is likely to spend more time breaking the input stream into tokens, than executing the parsing algorithm, so these differences may not matter a lot. 您的解析器可能会花费更多时间将输入流分解为令牌，而不是执行解析算法，因此这些差异可能并不重要。

In terms of getting your grammar into a usable form, the following is the order in which the parsing technologies "make it easy": 在将语法变为可用形式方面，以下是解析技术“使其变得简单”的顺序：

GLR (really easy: if you can write grammmar rules, you can parse) GLR（非常简单：如果你可以编写grammmar规则，你可以解析）
LR(k) (many grammars fit, extremely few parser generators) LR（k）（许多语法适合，极少数解析器生成器）
LR(1) (most commonly available [YACC, Bison, Gold, ...] LR（1）（最常见的[YACC，Bison，Gold，...]
LL (usually requires significant reengineering of grammar to remove left recursions) LL（通常需要重大的语法重新设计以消除左递归）
Hand-coded recursive descent (easy to code for simple grammars; difficult to handle complex grammars and difficult to maintain if the grammar changes a lot) 手动编码的递归下降（简单的语法易于编码;难以处理复杂的语法，如果语法变化很大，难以维护）

I did a study of LR parser speed between LRSTAR and YACC. 我对LRSTAR和YACC之间的LR解析器速度进行了研究。

In 1989 I compared the matrix parser tables defined in the paper, "Optimization Of Parser Tables For Portable Compilers" to the YACC parser tables (comb structure). 1989年，我将论文中定义的矩阵解析器表“优化便携式编译器的解析表”与YACC解析器表（梳状结构）进行了比较。 These are both LR or LALR parser tables. 这些都是LR或LALR解析器表。 I found that the matrix parser tables were usually two times the speed of the comb parser tables. 我发现矩阵解析器表的速度通常是梳形解析器表的两倍。 This is because the number of nonterminal transitions (goto actions) is usually about twice the number of terminal transitions. 这是因为非终结转换（转到动作）的数量通常约为终端转换数量的两倍。 The matrix tables have a faster nonterminal transition. 矩阵表具有更快的非终结转换。 However, there are many other things going on in a parser besides the state transitions, so this may not be the bottleneck. 但是，除了状态转换之外，解析器还有许多其他的事情，所以这可能不是瓶颈。

In 2009 I compared the matrix lexer tables to the flex-generated lexer tables and also to the direct-code lexers generated by re2c. 在2009年，我将矩阵词法分析器表与flex生成的词法分析器表以及re2c生成的直接代码词法分析器进行了比较。 I found that the matrix tables were about two times the speed of the flex generated tables and almost as fast as the re2c lexer code. 我发现矩阵表的速度大约是flex生成表的两倍，几乎和re2c词法分析器一样快。 The benefit of the matrix tables is that they compile much quicker that the direct-code tables and they are smaller. 矩阵表的好处是它们可以更快地编译直接代码表并且它们更小。 And finally, if you allow the matrix tables to be very large (with no compression) they can actually be faster than the direct-code (re2c) tables. 最后，如果您允许矩阵表非常大（没有压缩），它们实际上可以比直接代码（re2c）表更快。 For a graph showing the comparison see: the LRSTAR comparison page 有关显示比较的图表，请参阅： LRSTAR比较页面

Compiler front-ends (without preprocessing) built with LRSTAR are processing about 2,400,000 lines of code per second and this includes building a symbol table and abstract syntax tree while parsing and lexing. 使用LRSTAR构建的编译器前端（没有预处理）每秒处理大约2,400,000行代码，这包括在解析和lexing时构建符号表和抽象语法树。 The lexers built with DFA are processing 30,000,000 tokens per second. 使用DFA构建的词法分析器每秒处理30,000,000个令牌。 There is another advantage to matrix table-driven lexers when using DFA. 使用DFA时，矩阵表驱动词法分析器还有另一个优点。 The lexer skeleton can be rewritten in assembly language. 词法分析器骨架可以用汇编语言重写。 When I did this in 1986, the speed of the lexer was two times the speed of the C code version. 当我在1986年这样做时，词法分析器的速度是C代码版本速度的两倍。

I don't have much experience with LL parser speed or recursive descent parser speed. 我对LL解析器速度或递归下降解析器速度没有多少经验。 Sorry. 抱歉。 If ANTLR could generate C++ code, then I could do a speed test for its parsers. 如果ANTLR可以生成C ++代码，那么我可以对其解析器进行速度测试。