简体   繁体   中英

antlr grammar: Allow whitespace matching only in template string

I want to parse template strings:

`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`

Here is my grammar:

varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)*  ')' ;

WS      : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;

When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:

extraneous input ' ' expecting {'`'}

How can I allow whitespaces to be parsed and not skipped only inside the template string?

What is currently happening

When testing your example against your current grammar displaying the generated tokens, the lexer gives this:

[@0,0:0='`',<'`'>,1:0]
[@1,1:4='Some',<VAR>,1:1]
[@2,6:9='text',<VAR>,1:6]
[@3,11:12='${',<'${'>,1:11]
[@4,13:20='variable',<VAR>,1:13]
[@5,21:21='.',<'.'>,1:21]
[@6,22:25='name',<VAR>,1:22]
[@7,26:26='}',<'}'>,1:26]
... shortened ...
[@26,85:84='<EOF>',<EOF>,2:0]

This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR . Why is this happening?

As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some .

What you could try (Spoiler: won't work)

You could try to modify the rule like this:

TemplateStringLiteral: ('\\`' | ~'`')+ ;

so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:

  1. How would the lexer match anything to the VAR rule, ever?

  2. The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.

How to achieve what you actually want

There might be another solution, but this one works:

File MartinCup.g4:

parser grammar MartinCup;

options { tokenVocab=MartinCupLexer; }

templateString
    : BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
    ;

template
    : TemplateStart variable TemplateEnd
    ;

variable
    : varname funParameter? (Dot variable)*
    ;

varname
    : VAR
    ;

funParameter
    : OpenPar variable? (Comma variable)* ClosedPar
    ;

File MartinCupLexer.g4:

lexer grammar MartinCupLexer;

BackTick : '`' ;

TemplateStart
    : '${' -> pushMode(templateMode)
    ;

TemplateStringLiteral
    : '\\`'
    | ~'`'
    ;

mode templateMode;

VAR
    : [$]?[a-zA-Z0-9_]+
    | [$]
    ;

OpenPar : '(' ;
ClosedPar : ')' ;
Comma : ',' ;
Dot : '.' ;

TemplateEnd
    : '}' -> popMode;

This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some .

Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.

About the whitespaces

I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.

I tested your alternative grammar, where you put TemplateStringLiteral above WS , but contrary to your observation, this gives me:

line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}

The reason for this is the same as above, Some is lexed to VAR .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM