简体   繁体   中英

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:

+   -   *   /   ()

But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.

How can I achieve this?


The motive:

In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.

Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:

stmt:      ID '=' expr ';'
expr:      term | expr '-' term | expr '+' term
term:      factor | term '*' factor | term '/' factor
factor:    ID | NUMBER | '(' expr ')' | '-' factor

It's easy to define "set" literals:

set:       '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr

If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":

value:     expr | set

and modify the syntax for assignment statements to use that:

stmt:      ID '=' value ';'

But that leads to the reduce/reduce conflict you mention because (5) could be an expr , through the expansion exprtermfactor'(' expr ')' .

Here are three solutions to this ambiguity:

1. Explicitly remove the ambiguity

Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:

paren:     '(' expr ')'

and then for each subexpression type X , we add a production pp_X :

pp_term:   term | paren

and modify the existing production by allowing possibly parenthesized subexpressions as operands:

term:      factor | pp_term '*' pp_factor | pp_term '/' pp_factor

Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:

a = ( 5 )

having finished with the 5 , so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ; ) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set :

set:       '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr

Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces ( 5 ) to paren and waits for the next token to clarify the parse.

So that ends up with:

stmt:      ID '=' value ';'
value:     expr | set

set:       '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr

paren:     '(' expr ')'
pp_expr:   expr | paren
expr:      term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term:   term | paren
term:      factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor:    ID | NUMBER | '-' pp_factor

which has no conflicts.

2. Use a GLR parser

Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.

Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:

%glr-parser
%%
stmt:      ID '=' value ';'
value:     expr    %dprec 1
     |     set     %dprec 2
expr:      term | expr '-' term | expr '+' term
term:      factor | term '*' factor | term '/' factor
factor:    ID | NUMBER | '(' expr ')' | '-' factor
set:       '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr

The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)

3. Fix the language

While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change

a = ( some complicated expression ) * 2

to

a = ( some complicated expression )

and suddenly a becomes a set instead of a scalar.

It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").

Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while () , (5,) , (5,6) and (5,6,) are all tuples (the last two are semantically identical).

Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [] , [5] , [5,] , [5,6] and [5,6,] are all lists.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM