简体   繁体   中英

ANTLR4 grammar for SML choking on positive integer literals

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:

# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)

I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):

grammar SML_Small;

Whitespace : [ \t\r\n]+ -> skip ;

expression : CONSTANT ;

LABEL : [1-9] NUM* ;

CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;

On the other hand, removing LABEL makes positive numbers work again:

grammar SML_Small;

Whitespace : [ \t\r\n]+ -> skip ;

expression : CONSTANT ;

CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;

I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.

I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.

As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:

LABEL    : [1-9] NUM* ;
CONSTANT : INT ;
INT      : '~'? NUM ;
NUM      : DIGIT+ ;
DIGIT    : [0-9] ;

the input 1 will always become a LABEL . And input like 0 will always be a CONSTANT . An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens :

fragment NUM   : DIGIT+ ;
fragment DIGIT : [0-9] ;

That way, you can't accidentally use these tokens inside parser rules.

Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ; expression : '~' expression | ... ; .

Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:

grammar SML_Small;


expression
 : '(' expression ')'
 | '~' expression
 | integer 
 ;

integer
 : INT
 | INT_NON_ZERO
 ;

label
 : INT_NON_ZERO
 ;

INT_NON_ZERO : [1-9] DIGIT* ;
INT          : DIGIT+ ;
SPACES       : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM