简体   繁体   中英

How to parse grammar of XSD Regex with ANTLR4?

Dear Antlr4 community,

I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4. I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs . For this question I have simplified this grammar (by removing charClass) to:

grammar XSDRegExp;

regExp            :     branch ( '|' branch )* ;
branch            :     piece* ;
piece             :     atom quantifier? ;
quantifier        :     Quantifiers | '{'quantity'}' ;
quantity          :     quantRange | quantMin | QuantExact ;
quantRange        :     QuantExact ',' QuantExact ;
quantMin          :     QuantExact ',' ;
atom              :     NormalChar | '(' regExp ')' ;       // excluded | charClass  ;

QuantExact        :     [0-9]+ ;
NormalChar        :     ~[.\\?*+{}()|\[\]] ;        
Quantifiers       :     [?*+] ;     

Parsing seems to go fine:

input    a(bd){6,7}c{14,15}

However, I get an error message for:

input    12{3,4}

The error is:

line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}

I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.

I tried a number of changes:

[1] Swapping the definitions of QuantExact and NormalChar. But swapping introduces an error in the first input:

line 1:6 no viable alternative at input '6'

since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.

[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.

So nothing seems to work, therefore my question is: Can I parse this grammar with ANTLR4? And if so, how?

I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.

The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar , the characters 12 will always be matched as a QuantExact . The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.

You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom :

atom              :     normalChar | '(' regExp ')' ;
normalChar        :     NormalChar | QuantExact ;

Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG ). Something like this:

regExp            :     branch ( '|' branch )* ;
branch            :     piece* ;
piece             :     atom quantifier? ;
quantifier        :     Quantifiers | '{'quantity'}' ;
quantity          :     quantRange | quantMin | quantExact ;
quantRange        :     quantExact ',' quantExact ;
quantMin          :     quantExact ',' ;
atom              :     normalChar | '(' regExp ')' ; 
normalChar        :     NormalChar | Digit ;
quantExact        :     Digit+ ;

Digit             :     [0-9] ;
NormalChar        :     ~[.\\?*+{}()|\[\]] ;
Quantifiers       :     [?*+] ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM