[英]What is the difference between LL and LR parsing?
Can anyone give me a simple example of LL parsing versus LR parsing?谁能给我一个 LL 解析与 LR 解析的简单示例?
At a high level, the difference between LL parsing and LR parsing is that LL parsers begin at the start symbol and try to apply productions to arrive at the target string, whereas LR parsers begin at the target string and try to arrive back at the start symbol.在高层次上,LL 解析和 LR 解析之间的区别在于 LL 解析器从开始符号开始并尝试应用产生式以到达目标字符串,而 LR 解析器从目标字符串开始并尝试返回到开始处象征。
An LL parse is a left-to-right, leftmost derivation. LL 解析是从左到右、最左边的推导。 That is, we consider the input symbols from the left to the right and attempt to construct a leftmost derivation.
也就是说,我们从左到右考虑输入符号,并尝试构造最左推导。 This is done by beginning at the start symbol and repeatedly expanding out the leftmost nonterminal until we arrive at the target string.
这是通过从开始符号开始并重复展开最左边的非终结符直到我们到达目标字符串来完成的。 An LR parse is a left-to-right, rightmost derivation, meaning that we scan from the left to right and attempt to construct a rightmost derivation.
LR 解析是从左到右的最右推导,这意味着我们从左到右扫描并尝试构造最右推导。 The parser continuously picks a substring of the input and attempts to reverse it back to a nonterminal.
解析器不断地选择输入的一个子字符串,并尝试将其反转回非终结符。
During an LL parse, the parser continuously chooses between two actions:在 LL 解析期间,解析器不断地在两个动作之间进行选择:
As an example, given this grammar:例如,给定此语法:
int
int
Then given the string int + int + int
, an LL(2) parser (which uses two tokens of lookahead) would parse the string as follows:然后给定字符串
int + int + int
,LL(2) 解析器(使用两个先行标记)将按如下方式解析字符串:
Production Input Action
---------------------------------------------------------
S int + int + int Predict S -> E
E int + int + int Predict E -> T + E
T + E int + int + int Predict T -> int
int + E int + int + int Match int
+ E + int + int Match +
E int + int Predict E -> T + E
T + E int + int Predict T -> int
int + E int + int Match int
+ E + int Match +
E int Predict E -> T
T int Predict T -> int
int int Match int
Accept
Notice that in each step we look at the leftmost symbol in our production.请注意,在每一步中,我们都会查看产生式中最左边的符号。 If it's a terminal, we match it, and if it's a nonterminal, we predict what it's going to be by choosing one of the rules.
如果它是一个终端,我们匹配它,如果它是一个非终端,我们通过选择一个规则来预测它将是什么。
In an LR parser, there are two actions:在 LR 解析器中,有两个动作:
As an example, an LR(1) parser (with one token of lookahead) might parse that same string as follows:例如,LR(1) 解析器(带有一个先行标记)可能会按如下方式解析相同的字符串:
Workspace Input Action
---------------------------------------------------------
int + int + int Shift
int + int + int Reduce T -> int
T + int + int Shift
T + int + int Shift
T + int + int Reduce T -> int
T + T + int Shift
T + T + int Shift
T + T + int Reduce T -> int
T + T + T Reduce E -> T
T + T + E Reduce E -> T + E
T + E Reduce E -> T + E
E Reduce S -> E
S Accept
The two parsing algorithms you mentioned (LL and LR) are known to have different characteristics.众所周知,您提到的两种解析算法(LL 和 LR)具有不同的特性。 LL parsers tend to be easier to write by hand, but they are less powerful than LR parsers and accept a much smaller set of grammars than LR parsers do.
LL 解析器往往更容易手工编写,但它们不如 LR 解析器强大,并且接受比 LR 解析器小得多的语法集。 LR parsers come in many flavors (LR(0), SLR(1), LALR(1), LR(1), IELR(1), GLR(0), etc.) and are far more powerful.
LR 解析器有多种类型(LR(0)、SLR(1)、LALR(1)、LR(1)、IELR(1)、GLR(0) 等)并且功能更强大。 They also tend to have much more complex and are almost always generated by tools like
yacc
or bison
.它们也往往更复杂,并且几乎总是由
yacc
或bison
等工具生成。 LL parsers also come in many flavors (including LL(*), which is used by the ANTLR
tool), though in practice LL(1) is the most-widely used. LL 解析器也有多种形式(包括 LL(*),它被
ANTLR
工具使用),尽管实际上 LL(1) 是使用最广泛的。
As a shameless plug, if you'd like to learn more about LL and LR parsing, I just finished teaching a compilers course and have some handouts and lecture slides on parsing on the course website.作为一个无耻的插件,如果你想了解更多关于 LL 和 LR 解析的知识,我刚刚完成了编译器课程的教学,并且在课程网站上有一些关于解析的讲义和讲座幻灯片。 I'd be glad to elaborate on any of them if you think it would be useful.
如果您认为有用,我很乐意详细说明其中的任何一个。
Josh Haberman in his article LL and LR Parsing Demystified claims that LL parsing directly corresponds with the Polish Notation , whereas LR corresponds to Reverse Polish Notation . Josh Haberman 在他的文章LL and LR Parsing Demystified中声称 LL 解析直接对应于波兰表示法,而 LR 对应于反向波兰表示法。 The difference between PN and RPN is the order of traversing the binary tree of the equation:
PN和RPN的区别在于遍历方程二叉树的顺序:
+ 1 * 2 3 // Polish (prefix) expression; pre-order traversal.
1 2 3 * + // Reverse Polish (postfix) expression; post-order traversal.
According to Haberman, this illustrates the main difference between LL and LR parsers:根据 Haberman 的说法,这说明了 LL 和 LR 解析器之间的主要区别:
The primary difference between how LL and LR parsers operate is that an LL parser outputs a pre-order traversal of the parse tree and an LR parser outputs a post-order traversal.
LL 和 LR 解析器操作方式的主要区别在于,LL 解析器输出解析树的前序遍历,而 LR 解析器输出后序遍历。
For the in-depth explanation, examples and conclusions check out Haberman's article .有关深入的解释、示例和结论,请查看 Haberman 的文章。
LL parsing is handicapped, when compared to LR.与 LR 相比,LL 解析有缺陷。 Here is a grammar that is a nightmare for an LL parser generator:
这是一个 LL 解析器生成器的噩梦语法:
Goal -> (FunctionDef | FunctionDecl)* <eof>
FunctionDef -> TypeSpec FuncName '(' [Arg/','+] ')' '{' '}'
FunctionDecl -> TypeSpec FuncName '(' [Arg/','+] ')' ';'
TypeSpec -> int
-> char '*' '*'
-> long
-> short
FuncName -> IDENTIFIER
Arg -> TypeSpec ArgName
ArgName -> IDENTIFIER
A FunctionDef looks exactly like a FunctionDecl until the ';' FunctionDef 看起来与 FunctionDecl 完全一样,直到 ';' or '{' is encountered.
或遇到“{”。
An LL parser cannot handle two rules at the same time, so it must chose either FunctionDef or FunctionDecl. LL 解析器不能同时处理两个规则,因此它必须选择 FunctionDef 或 FunctionDecl。 But to know which is correct it has to lookahead for a ';'
但是要知道哪个是正确的,它必须向前看';' or '{'.
要么 '{'。 At grammar analysis time, the lookahead (k) appears to be infinite.
在语法分析时,先行 (k) 似乎是无限的。 At parsing time it is finite, but could be large.
在解析时它是有限的,但可能很大。
An LR parser does not have to lookahead, because it can handle two rules at the same time. LR 解析器不必向前看,因为它可以同时处理两个规则。 So LALR(1) parser generators can handle this grammar with ease.
所以 LALR(1) 解析器生成器可以轻松处理这种语法。
Given the input code:给定输入代码:
int main (int na, char** arg);
int main (int na, char** arg)
{
}
An LR parser can parse the LR解析器可以解析
int main (int na, char** arg)
without caring what rule is being recognized until it encounters a ';'在遇到“;”之前不关心正在识别什么规则or a '{'.
或“{”。
An LL parser gets hung up at the 'int' because it needs to know which rule is being recognized. LL 解析器在“int”处挂起,因为它需要知道正在识别哪个规则。 Therefore it must lookahead for a ';'
因此它必须向前看';' or '{'.
要么 '{'。
The other nightmare for LL parsers is left recursion in a grammar. LL 解析器的另一个噩梦是语法中的左递归。 Left recursion is a normal thing in grammars, no problem for an LR parser generator, but LL can't handle it.
左递归在语法中是正常的事情,对于 LR 解析器生成器来说没有问题,但 LL 无法处理它。
So you have to write your grammars in an unnatural way with LL.所以你必须用 LL 以一种不自然的方式编写你的语法。
The LL uses top-down, while the LR uses bottom-up approach. LL 使用自上而下的方法,而 LR 使用自下而上的方法。
If you parse a progamming language:如果你解析一种编程语言:
谁能给我一个 LL 解析与 LR 解析的简单例子?
Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables.除了上述答案之外,自底向上解析器类中的各个解析器之间的区别在于它们在生成解析表时是否会导致 shift/reduce 或 reduce/reduce 冲突。 The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
冲突越少,语法就越强大(LR(0) < SLR(1) < LALR(1) < CLR(1))。
For example, consider the following expression grammar:例如,考虑以下表达式语法:
E → E + T E → E + T
E → T E → T
T → F T → F
T → T * F T → T * F
F → ( E ) F → ( E )
F → id F→id
It's not LR(0) but SLR(1).它不是 LR(0) 而是 SLR(1)。 Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):
使用以下代码,我们可以构建 LR0 自动机并构建解析表(我们需要扩充语法、使用闭包计算 DFA、计算动作和转到集):
from copy import deepcopy
import pandas as pd
def update_items(I, C):
if len(I) == 0:
return C
for nt in C:
Int = I.get(nt, [])
for r in C.get(nt, []):
if not r in Int:
Int.append(r)
I[nt] = Int
return I
def compute_action_goto(I, I0, sym, NTs):
#I0 = deepcopy(I0)
I1 = {}
for NT in I:
C = {}
for r in I[NT]:
r = r.copy()
ix = r.index('.')
#if ix == len(r)-1: # reduce step
if ix >= len(r)-1 or r[ix+1] != sym:
continue
r[ix:ix+2] = r[ix:ix+2][::-1] # read the next symbol sym
C = compute_closure(r, I0, NTs)
cnt = C.get(NT, [])
if not r in cnt:
cnt.append(r)
C[NT] = cnt
I1 = update_items(I1, C)
return I1
def construct_LR0_automaton(G, NTs, Ts):
I0 = get_start_state(G, NTs, Ts)
I = deepcopy(I0)
queue = [0]
states2items = {0: I}
items2states = {str(to_str(I)):0}
parse_table = {}
cur = 0
while len(queue) > 0:
id = queue.pop(0)
I = states[id]
# compute goto set for non-terminals
for NT in NTs:
I1 = compute_action_goto(I, I0, NT, NTs)
if len(I1) > 0:
state = str(to_str(I1))
if not state in statess:
cur += 1
queue.append(cur)
states2items[cur] = I1
items2states[state] = cur
parse_table[id, NT] = cur
else:
parse_table[id, NT] = items2states[state]
# compute actions for terminals similarly
# ... ... ...
return states2items, items2states, parse_table
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)
where the grammar G, non-terminal and terminal symbols are defined as below其中语法 G、非终结符和终结符定义如下
G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]
Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:以下是我为 LR(0) 解析表生成实现的一些更有用的函数:
def augment(G, S): # start symbol S
G[S + '1'] = [[S, '$']]
NTs.append(S + '1')
return G, NTs
def compute_closure(r, G, NTs):
S = {}
queue = [r]
seen = []
while len(queue) > 0:
r = queue.pop(0)
seen.append(r)
ix = r.index('.') + 1
if ix < len(r) and r[ix] in NTs:
S[r[ix]] = G[r[ix]]
for rr in G[r[ix]]:
if not rr in seen:
queue.append(rr)
return S
The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:下图(展开查看)显示了使用上述代码为语法构造的LR0 DFA:
The following table shows the LR0 parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).下表显示了作为 Pandas 数据帧生成的 LR0 解析表,注意有几个 shift/reduce 冲突,表明语法不是 LR(0)。
SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. SLR(1) 解析器仅在下一个输入标记是被归约的非终结符的 Follow Set 的成员时才进行归约,从而避免了上述移位/归约冲突。 So the above grammar is not LR(0), but it's SLR(1).
所以上面的文法不是LR(0),而是SLR(1)。
But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1
is LR(0):但是,以下接受
a^ncb^n, n >= 1
形式的字符串的语法是 LR(0):
A → a A b A → a A b
A → c A → C
S → A S → A
Let's define the grammar as follows:让我们定义语法如下:
# S --> A
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]
As can be seen from the following figure, there is no conflict in the parsing table generated.从下图可以看出,生成的解析表没有冲突。
Here is how the input string a^2cb^2
can be parsed using the above LR(0) parse table, using the following code:下面是如何使用上面的 LR(0) 解析表解析输入字符串
a^2cb^2
,使用以下代码:
def parse(input, parse_table, rules):
input = 'aaacbbb$'
stack = [0]
df = pd.DataFrame(columns=['stack', 'input', 'action'])
i, accepted = 0, False
while i < len(input):
state = stack[-1]
char = input[i]
action = parse_table.loc[parse_table.states == state, char].values[0]
if action[0] == 's': # shift
stack.append(char)
stack.append(int(action[-1]))
i += 1
elif action[0] == 'r': # reduce
r = rules[int(action[-1])]
l, r = r['l'], r['r']
char = ''
for j in range(2*len(r)):
s = stack.pop()
if type(s) != int:
char = s + char
if char == r:
goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
stack.append(l)
stack.append(int(goto[-1]))
elif action == 'acc': # accept
accepted = True
df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
df = df.append(df2, ignore_index = True)
if accepted:
break
return df
parse(input, parse_table, rules)
where the grammar rules are:其中语法规则是:
S → A S → A
A → a A b A → a A b
A → c A → C
The next animation shows how the string is parsed and accepted when the above code is run:下一个动画显示了在运行上述代码时如何解析和接受字符串:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.