简体   繁体   中英

Use token tokens in ANTLR4

I ran into a problem with ANTLR and I wonder if a situation like this is even acceptable in ANTLR. I have prepared a very simplified example below.

grammar test;

test
    : statement*
    ;

statement
    : s1
    | s2
    ;

s1
    : 'OK' INT
    ;

s2
    : 'ABC' US_INT
    ;

INT
    : S_INT
    | US_INT
    ;

S_INT
   : [+-] [0-9]+
   ;

US_INT
    : [0-9]+
    ;

For OK 5 everything is ok, but for ABC 5 I get the following error:

line 1:4 mismatched input '5' expecting US_INT

I was running the grun with the -tokens option and I have here INT instead of US_INT

[@1,4:4='5',<INT>,1:4]

This made me wonder if such a situation in ANTLR was possible at all. Previously, I tried reordering tokens, moving US_INT out of INT , fragments and some other things, but it didn't work well. The only change was that OK 5 stopped working and ABC 5 started. I would like both of these cases to work without errors.

The problem you're facing is quite simple: 5 can match both: US_INT (since it contains US_INT ) and S_INT itself. But, as long as INT is declared higher than US_INT , the lexer is going to resolve 5 as INT .

To solve it, I'd suggest you moving INT from lexer tokens to parser rules, like this:

grammar test;

test
    : statement*
    ;

statement
    : s1
    | s2
    ;

s1
    : 'OK' int_stmt
    ;

s2
    : 'ABC' US_INT
    ;
    
int_stmt
    : S_INT | US_INT
    ;

S_INT
   : [+-] [0-9]+
   ;

US_INT
    : [0-9]+
    ;

If you want to escape, in this case, from the priorities of the lexing, you can use this ABNF parser grammar in Tunnel Grammar Studio, which does not have this issue at all:

test         = *statement
statement    = s-ok / s-abc
s-ok         = "OK" 1*ws int
s-abc        = "ABC" 1*ws unsigned-int
int          = signed-int / unsigned-int
signed-int   = ('+' / '-') unsigned-int 
unsigned-int = 1*('0'-'9')
ws           = %x20 / %x9 / %xA / %xD

This is the case of case-insensitive matching, as defined in ABNF (RFC 5234). You can also define explicitly the case sensitive or insensitive matching per string as %s"ABC" or %i"ABC" respectively (RFC 7405). When you start to have more statements, some strings will start to overlap, then you can make yourself keywords in the lexer grammar:

keyword      = %s"OK" / %s"OK2"

and in the parser grammar do:

s-ok         = {keyword, %s"OK"} 1*ws int 
s-ok-2       = {keyword, %s"OK2"} 1*ws int 1*ws int 
s-ok-any     = {keyword} 1*ws int *(ws 0*1 int)

Note that the last rule, will allow you to have any white space in between the integers and any keyword will match.

*I develop Tunnel Grammar Studio. The grammar is quite simple, so the demo may be enough for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM