简体   繁体   English

LL 和 LR 解析有什么区别?

[英]What is the difference between LL and LR parsing?

Can anyone give me a simple example of LL parsing versus LR parsing?谁能给我一个 LL 解析与 LR 解析的简单示例?

At a high level, the difference between LL parsing and LR parsing is that LL parsers begin at the start symbol and try to apply productions to arrive at the target string, whereas LR parsers begin at the target string and try to arrive back at the start symbol.在高层次上,LL 解析和 LR 解析之间的区别在于 LL 解析器从开始符号开始并尝试应用产生式以到达目标字符串,而 LR 解析器从目标字符串开始并尝试返回到开始处象征。

An LL parse is a left-to-right, leftmost derivation. LL 解析是从左到右、最左边的推导。 That is, we consider the input symbols from the left to the right and attempt to construct a leftmost derivation.也就是说,我们从左到右考虑输入符号,并尝试构造最左推导。 This is done by beginning at the start symbol and repeatedly expanding out the leftmost nonterminal until we arrive at the target string.这是通过从开始符号开始并重复展开最左边的非终结符直到我们到达目标字符串来完成的。 An LR parse is a left-to-right, rightmost derivation, meaning that we scan from the left to right and attempt to construct a rightmost derivation. LR 解析是从左到右的最右推导,这意味着我们从左到右扫描并尝试构造最右推导。 The parser continuously picks a substring of the input and attempts to reverse it back to a nonterminal.解析器不断地选择输入的一个子字符串,并尝试将其反转回非终结符。

During an LL parse, the parser continuously chooses between two actions:在 LL 解析期间,解析器不断地在两个动作之间进行选择:

  1. Predict : Based on the leftmost nonterminal and some number of lookahead tokens, choose which production ought to be applied to get closer to the input string.预测:基于最左边的非终结符和一定数量的前瞻标记,选择应该应用哪个产生式以更接近输入字符串。
  2. Match : Match the leftmost guessed terminal symbol with the leftmost unconsumed symbol of input.匹配:将最左边猜到的终端符号与最左边未使用的输入符号匹配。

As an example, given this grammar:例如,给定此语法:

  • S → E小号→小号
  • E → T + E E → T + E
  • E → T E→T
  • T → int T → int

Then given the string int + int + int , an LL(2) parser (which uses two tokens of lookahead) would parse the string as follows:然后给定字符串int + int + int ,LL(2) 解析器(使用两个先行标记)将按如下方式解析字符串:

Production       Input              Action
---------------------------------------------------------
S                int + int + int    Predict S -> E
E                int + int + int    Predict E -> T + E
T + E            int + int + int    Predict T -> int
int + E          int + int + int    Match int
+ E              + int + int        Match +
E                int + int          Predict E -> T + E
T + E            int + int          Predict T -> int
int + E          int + int          Match int
+ E              + int              Match +
E                int                Predict E -> T
T                int                Predict T -> int
int              int                Match int
                                    Accept

Notice that in each step we look at the leftmost symbol in our production.请注意,在每一步中,我们都会查看产生式中最左边的符号。 If it's a terminal, we match it, and if it's a nonterminal, we predict what it's going to be by choosing one of the rules.如果它是一个终端,我们匹配它,如果它是一个非终端,我们通过选择一个规则来预测它将是什么。

In an LR parser, there are two actions:在 LR 解析器中,有两个动作:

  1. Shift : Add the next token of input to a buffer for consideration. Shift :将下一个输入标记添加到缓冲区以供考虑。
  2. Reduce : Reduce a collection of terminals and nonterminals in this buffer back to some nonterminal by reversing a production. Reduce :通过反转生产将此缓冲区中的终端和非终端集合减少回某个非终端。

As an example, an LR(1) parser (with one token of lookahead) might parse that same string as follows:例如,LR(1) 解析器(带有一个先行标记)可能会按如下方式解析相同的字符串:

Workspace        Input              Action
---------------------------------------------------------
                 int + int + int    Shift
int              + int + int        Reduce T -> int
T                + int + int        Shift
T +              int + int          Shift
T + int          + int              Reduce T -> int
T + T            + int              Shift
T + T +          int                Shift
T + T + int                         Reduce T -> int
T + T + T                           Reduce E -> T
T + T + E                           Reduce E -> T + E
T + E                               Reduce E -> T + E
E                                   Reduce S -> E
S                                   Accept

The two parsing algorithms you mentioned (LL and LR) are known to have different characteristics.众所周知,您提到的两种解析算法(LL 和 LR)具有不同的特性。 LL parsers tend to be easier to write by hand, but they are less powerful than LR parsers and accept a much smaller set of grammars than LR parsers do. LL 解析器往往更容易手工编写,但它们不如 LR 解析器强大,并且接受比 LR 解析器小得多的语法集。 LR parsers come in many flavors (LR(0), SLR(1), LALR(1), LR(1), IELR(1), GLR(0), etc.) and are far more powerful. LR 解析器有多种类型(LR(0)、SLR(1)、LALR(1)、LR(1)、IELR(1)、GLR(0) 等)并且功能更强大。 They also tend to have much more complex and are almost always generated by tools like yacc or bison .它们也往往更复杂,并且几乎总是由yaccbison等工具生成。 LL parsers also come in many flavors (including LL(*), which is used by the ANTLR tool), though in practice LL(1) is the most-widely used. LL 解析器也有多种形式(包括 LL(*),它被ANTLR工具使用),尽管实际上 LL(1) 是使用最广泛的。

As a shameless plug, if you'd like to learn more about LL and LR parsing, I just finished teaching a compilers course and have some handouts and lecture slides on parsing on the course website.作为一个无耻的插件,如果你想了解更多关于 LL 和 LR 解析的知识,我刚刚完成了编译器课程的教学,并且在课程网站上有一些关于解析的讲义和讲座幻灯片 I'd be glad to elaborate on any of them if you think it would be useful.如果您认为有用,我很乐意详细说明其中的任何一个。

Josh Haberman in his article LL and LR Parsing Demystified claims that LL parsing directly corresponds with the Polish Notation , whereas LR corresponds to Reverse Polish Notation . Josh Haberman 在他的文章LL and LR Parsing Demystified中声称 LL 解析直接对应于波兰表示法,而 LR 对应于反向波兰表示法 The difference between PN and RPN is the order of traversing the binary tree of the equation: PN和RPN的区别在于遍历方程二叉树的顺序:

方程的二叉树

+ 1 * 2 3  // Polish (prefix) expression; pre-order traversal.
1 2 3 * +  // Reverse Polish (postfix) expression; post-order traversal.

According to Haberman, this illustrates the main difference between LL and LR parsers:根据 Haberman 的说法,这说明了 LL 和 LR 解析器之间的主要区别:

The primary difference between how LL and LR parsers operate is that an LL parser outputs a pre-order traversal of the parse tree and an LR parser outputs a post-order traversal. LL 和 LR 解析器操作方式的主要区别在于,LL 解析器输出解析树的前序遍历,而 LR 解析器输出后序遍历。

For the in-depth explanation, examples and conclusions check out Haberman's article .有关深入的解释、示例和结论,请查看 Haberman 的文章

LL parsing is handicapped, when compared to LR.与 LR 相比,LL 解析有缺陷。 Here is a grammar that is a nightmare for an LL parser generator:这是一个 LL 解析器生成器的噩梦语法:

Goal           -> (FunctionDef | FunctionDecl)* <eof>                  

FunctionDef    -> TypeSpec FuncName '(' [Arg/','+] ')' '{' '}'       

FunctionDecl   -> TypeSpec FuncName '(' [Arg/','+] ')' ';'            

TypeSpec       -> int        
               -> char '*' '*'                
               -> long                 
               -> short                   

FuncName       -> IDENTIFIER                

Arg            -> TypeSpec ArgName         

ArgName        -> IDENTIFIER 

A FunctionDef looks exactly like a FunctionDecl until the ';' FunctionDef 看起来与 FunctionDecl 完全一样,直到 ';' or '{' is encountered.或遇到“{”。

An LL parser cannot handle two rules at the same time, so it must chose either FunctionDef or FunctionDecl. LL 解析器不能同时处理两个规则,因此它必须选择 FunctionDef 或 FunctionDecl。 But to know which is correct it has to lookahead for a ';'但是要知道哪个是正确的,它必须向前看';' or '{'.要么 '{'。 At grammar analysis time, the lookahead (k) appears to be infinite.在语法分析时,先行 (k) 似乎是无限的。 At parsing time it is finite, but could be large.在解析时它是有限的,但可能很大。

An LR parser does not have to lookahead, because it can handle two rules at the same time. LR 解析器不必向前看,因为它可以同时处理两个规则。 So LALR(1) parser generators can handle this grammar with ease.所以 LALR(1) 解析器生成器可以轻松处理这种语法。

Given the input code:给定输入代码:

int main (int na, char** arg); 

int main (int na, char** arg) 
{

}

An LR parser can parse the LR解析器可以解析

int main (int na, char** arg)

without caring what rule is being recognized until it encounters a ';'在遇到“;”之前不关心正在识别什么规则or a '{'.或“{”。

An LL parser gets hung up at the 'int' because it needs to know which rule is being recognized. LL 解析器在“int”处挂起,因为它需要知道正在识别哪个规则。 Therefore it must lookahead for a ';'因此它必须向前看';' or '{'.要么 '{'。

The other nightmare for LL parsers is left recursion in a grammar. LL 解析器的另一个噩梦是语法中的左递归。 Left recursion is a normal thing in grammars, no problem for an LR parser generator, but LL can't handle it.左递归在语法中是正常的事情,对于 LR 解析器生成器来说没有问题,但 LL 无法处理它。

So you have to write your grammars in an unnatural way with LL.所以你必须用 LL 以一种不自然的方式编写你的语法。

The LL uses top-down, while the LR uses bottom-up approach. LL 使用自上而下的方法,而 LR 使用自下而上的方法。

If you parse a progamming language:如果你解析一种编程语言:

  • The LL sees a source code, which contains functions, which contains expression. LL 看到一个源代码,其中包含函数,其中包含表达式。
  • The LR sees expression, which belongs to functions, which results the full source. LR see expression,属于函数,结果全源。

谁能给我一个 LL 解析与 LR 解析的简单例子?

Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables.除了上述答案之外,自底向上解析器类中的各个解析器之间的区别在于它们在生成解析表时是否会导致 shift/reduce 或 reduce/reduce 冲突。 The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).冲突越少,语法就越强大(LR(0) < SLR(1) < LALR(1) < CLR(1))。

For example, consider the following expression grammar:例如,考虑以下表达式语法:

E → E + T E → E + T

E → T E → T

T → F T → F

T → T * F T → T * F

F → ( E ) F → ( E )

F → id F→id

It's not LR(0) but SLR(1).它不是 LR(0) 而是 SLR(1)。 Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):使用以下代码,我们可以构建 LR0 自动机并构建解析表(我们需要扩充语法、使用闭包计算 DFA、计算动作和转到集):

from copy import deepcopy
import pandas as pd

def update_items(I, C):
    if len(I) == 0:
         return C
    for nt in C:
         Int = I.get(nt, [])
         for r in C.get(nt, []):
              if not r in Int:
                  Int.append(r)
          I[nt] = Int
     return I

def compute_action_goto(I, I0, sym, NTs): 
    #I0 = deepcopy(I0)
    I1 = {}
    for NT in I:
        C = {}
        for r in I[NT]:
            r = r.copy()
            ix = r.index('.')
            #if ix == len(r)-1: # reduce step
            if ix >= len(r)-1 or r[ix+1] != sym:
                continue
            r[ix:ix+2] = r[ix:ix+2][::-1]    # read the next symbol sym
            C = compute_closure(r, I0, NTs)
            cnt = C.get(NT, [])
            if not r in cnt:
                cnt.append(r)
            C[NT] = cnt
        I1 = update_items(I1, C)
    return I1

def construct_LR0_automaton(G, NTs, Ts):
    I0 = get_start_state(G, NTs, Ts)
    I = deepcopy(I0)
    queue = [0]
    states2items = {0: I}
    items2states = {str(to_str(I)):0}
    parse_table = {}
    cur = 0
    while len(queue) > 0:
        id = queue.pop(0)
        I = states[id]
        # compute goto set for non-terminals
        for NT in NTs:
            I1 = compute_action_goto(I, I0, NT, NTs) 
            if len(I1) > 0:
                state = str(to_str(I1))
                if not state in statess:
                    cur += 1
                    queue.append(cur)
                    states2items[cur] = I1
                    items2states[state] = cur
                    parse_table[id, NT] = cur
                else:
                    parse_table[id, NT] = items2states[state]
        # compute actions for terminals similarly
        # ... ... ...
                    
    return states2items, items2states, parse_table
        
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)

where the grammar G, non-terminal and terminal symbols are defined as below其中语法 G、非终结符和终结符定义如下

G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]

Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:以下是我为 LR(0) 解析表生成实现的一些更有用的函数:

def augment(G, S): # start symbol S
    G[S + '1'] = [[S, '$']]
    NTs.append(S + '1')
    return G, NTs

def compute_closure(r, G, NTs):
    S = {}
    queue = [r]
    seen = []
    while len(queue) > 0:
        r = queue.pop(0)
        seen.append(r)
        ix = r.index('.') + 1
        if ix < len(r) and r[ix] in NTs:
            S[r[ix]] = G[r[ix]]
            for rr in G[r[ix]]:
                if not rr in seen:
                    queue.append(rr)
    return S

The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:下图(展开查看)显示了使用上述代码为语法构造的LR0 DFA:

在此处输入图片说明

The following table shows the LR0 parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).下表显示了作为 Pandas 数据帧生成的 LR0 解析表,注意有几个 shift/reduce 冲突,表明语法不是 LR(0)。

在此处输入图片说明

SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. SLR(1) 解析器仅在下一个输入标记是被归约的非终结符的 Follow Set 的成员时才进行归约,从而避免了上述移位/归约冲突。 So the above grammar is not LR(0), but it's SLR(1).所以上面的文法不是LR(0),而是SLR(1)。

But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):但是,以下接受a^ncb^n, n >= 1形式的字符串的语法是 LR(0):

A → a A b A → a A b

A → c A → C

S → A S → A

Let's define the grammar as follows:让我们定义语法如下:

# S --> A 
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]

在此处输入图片说明

As can be seen from the following figure, there is no conflict in the parsing table generated.从下图可以看出,生成的解析表没有冲突。

![在此处输入图片说明

Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:下面是如何使用上面的 LR(0) 解析表解析输入字符串a^2cb^2 ,使用以下代码:

def parse(input, parse_table, rules):
    input = 'aaacbbb$'
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    i, accepted = 0, False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            stack.append(char)
            stack.append(int(action[-1]))
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == r:
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
                stack.append(l)
                stack.append(int(goto[-1]))
        elif action == 'acc':  # accept
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
        df = df.append(df2, ignore_index = True)
        if accepted:
            break
        
    return df

parse(input, parse_table, rules)

where the grammar rules are:其中语法规则是:

S → A S → A

A → a A b A → a A b

A → c A → C

The next animation shows how the string is parsed and accepted when the above code is run:下一个动画显示了在运行上述代码时如何解析和接受字符串:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM