Why does the order of ANTLR4 tokens matter?

Question

I have a simple grammar that will eventually parse YANG source. When I make when seem to be an arbitrary change the location of the MODULE token the IntelliJ ANTLR4 Plugin can/cannot parse my input.

The input string to be parsed:

module x { }

Here is the grammar that works without any error:

grammar Yang ;

yang: module_open module_close;

module_open : MODULE ID BRACKET_OPEN ;

module_close: BRACKET_CLOSE ;

MODULE: 'module' ;

ID: ([A-Za-z][A-Za-z0-9_-]*) ;
BRACKET_OPEN: '{' ;
BRACKET_CLOSE: '}' ;

WS: [ \t\r\n]+ -> skip ;

Here is the grammar that fails:

grammar Yang ;

yang: module_open module_close;

module_open : MODULE ID BRACKET_OPEN ;

module_close: BRACKET_CLOSE ;

ID: ([A-Za-z][A-Za-z0-9_-]*) ;

MODULE: 'module' ;

BRACKET_OPEN: '{' ;
BRACKET_CLOSE: '}' ;

WS: [ \t\r\n]+ -> skip ;

All I'm doing is cutting-pasting the MODULE token definition before/after the ID token, and it always fails if the MODULE definition is after the ID definition.

What am I missing? I see no discussion of order of tokens in the docs!

EDIT: @BartKiers Related Post... ANTLR4 lexer rules don't work as expected

Answer 1

It fails if module is after ID because the text 'module' is also a valid 'ID'. If the ID rule appears first, then it has precedence. That's when the order of lexer rules matters, when two or more lexer rules can match the same input. In this case, the one appearing first trumps those that follow; it has precedence.

Your excellent test case here is a perfect and exemplary illustration of this behavior at work.

There used to be in the ANTLR4 documentation here a great article by none other than Sam Harwell that explained this perfectly, but I can no longer find it.

Answer 2

From the book of Antlr (section 5.5):

Matching Identifiers

In grammar pseudocode, a basic identifier is a nonempty sequence of upper- case and lowercase letters. Using our newfound skills, we know to express the sequence pattern using notation (...)+ . Because the elements of the sequence can be either uppercase or lowercase letters, we also know that we'll have a choice operator inside the subrule.

ID : ('a'..'z'|'A'..'Z')+ ; // ID : ('a'..'z'|'A'..'Z')+ ; // match 1-or-more upper or lowercase letters

The only new ANTLR notation here is the range operator: 'a'..'z' means any character from a to z. That is literally the ASCII code range from 97 to 122. To use Unicode code points, we need to use '\\uXXXX' literals where XXXX is the hexadecimal value for the Unicode character code point value.

As a shorthand for character sets, ANTLR supports the more familiar regular expression set notation.

ID : [a-zA-Z]+ ; // ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters

Rules such as ID sometimes conflict with other lexical rules or literals refer- enced in the grammar such as 'enum' .

grammar KeywordTest;
enumDef : 'enum' '{' ... '}' ;
...
FOR : 'for' ;
...
ID : [a-zA-Z]+ ; // does NOT match 'enum' or 'for'

Rule ID could also match keywords such as enum and for , which means there's more than one rule that could match the same string. To make this clearer, consider how ANTLR handles combined lexer/parser grammars such as this. ANTLR collects and separates all of the string literals and lexer rules from the parser rules. Literals such as 'enum' become lexical rules and go immediately after the parser rules but before the explicit lexical rules.

ANTLR lexers resolve ambiguities between lexical rules by favoring the rule specified first. That means your ID rule should be defined after all of your keyword rules, like it is here relative to FOR. ANTLR puts the implicitly gener- ated lexical rules for literals before explicit lexer rules, so those always have priority. In this case, 'enum' is given priority over ID automatically. Because ANTLR reorders the lexical rules to occur after the parser rules, the following variation on KeywordTest results in the same parser and lexer:

grammar KeywordTestReordered;
FOR : 'for' ;
ID : [a-zA-Z]+ ; // does NOT match 'enum' or 'for' ...
enumDef : 'enum' '{' ... '}' ;
...

Why does the order of ANTLR4 tokens matter?

Question

2 answers

solution1
3 2017-08-23 15:09:04

solution2
1 2020-01-28 07:57:15

Why does the order of ANTLR4 tokens matter?

Question

2 answers

solution1 3 2017-08-23 15:09:04

solution2 1 2020-01-28 07:57:15

solution1
3 2017-08-23 15:09:04

solution2
1 2020-01-28 07:57:15