I am writing an ANTLR Lexer and Parser grammar that will parse text that is quite similar to a Java class. Eventually it will parse text like the following:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author {
}
I am building up the Lexer and Parser slowly. I have successfully managed to parse the reference
s but have hit a wall when parsing the type
.
Before adding support for the type
I was able to use string literals for space, colon, and semi-colon in the parser but after I encountered cannot create implicit token for string literal
errors. I defined a lexer rule for each of those characters and replaced all occurrences of the literal with the rule. However this broke the parsing of reference
s.
I have included my lexer and parser that successfully parses reference
s below (along with a sample input and the parsed abstract syntax tree) and the evolved versions which isn't working. I am not getting any compilation errors but plenty of token recognition error
s (screenshot included below).
What is the correct way to handle the parsing?
lexer grammar WorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: ' ' -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: ':' -> pushMode(IriMode);
END_IRI: ';' -> popMode;
mode IriMode;
IRI: String -> popMode;
parser grammar WorkingParserGrammar ;
options { tokenVocab=WorkingLexerGrammar; }
document: reference* EOF ;
prefixedReference: REFERENCE_PREFIX ':' IRI;
reference: REFERENCE_KEYWORD ' ' prefixedReference ';';
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
COLON: Colon;
SEMICOLON: SemiColon;
SPACE: ' ';
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX COLON IRI;
reference: REFERENCE_KEYWORD SPACE prefixedReference SEMICOLON;
prefixedName: NAME_PREFIX SPACE LOCAL_NAME;
type: TYPE_KEYWORD SPACE prefixedName;
Following Bart Kiers' help I have made two updates to the lexer and parser grammars with varying success.
This change parses the type definition correctly but only if I remove the lexer rules for reference. I think the reason for that is that the two rules are the same (ie PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode);
for reference and PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode);
for type) – that is they both match on a space. My second update attempts to fix this but the full lexer and parser grammars are below.
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: REFERENCE_KEYWORD PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_KEYWORD PREFIXED_NAME prefixedName END_NAME;
In an attempt to fix this I moved the reference
and type
keywords to the Lexer rules for the corresponding parts but this only parses the type if I remove all of the Lexer rules for reference. However references are parsed correctly.
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: REFERENCE_KEYWORD SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
TYPE_DEFINITION: TYPE_KEYWORD SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_DEFINITION prefixedName END_NAME;
For the following input:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author;
This is the output:
line 4:0 token recognition error at: 't'
line 4:1 token recognition error at: 'y'
line 4:2 token recognition error at: 'p'
line 4:3 token recognition error at: 'e'
line 4:4 token recognition error at: ' '
line 4:5 token recognition error at: 'd'
line 4:6 token recognition error at: 'c'
line 4:7 token recognition error at: ':'
line 4:8 token recognition error at: 'A'
line 4:9 token recognition error at: 'u'
line 4:10 token recognition error at: 't'
line 4:11 token recognition error at: 'h'
line 4:12 token recognition error at: 'o'
line 4:13 token recognition error at: 'r;'
My reasoning for using modes is to limit the scope of rules. This is a language I control but would prefer not to change it dramatically. There is much more to the language than I've shown here and we have already have a grammar (currently a combined grammar) but it is quite brittle. I tried to make a change to prevent uppercase characters in prefixes but permit them in the local name but this snowballed and other rules started applying. Research suggested that modes was an approach to handle this situation but I'm not very familiar with ANTLR so I've possibly misunderstood it.
When encountering errors/warnings like these:
line 4:0 token recognition error at: 't'
line 4:1 token recognition error at: 'y'
line 4:2 token recognition error at: 'p'
line 4:3 token recognition error at: 'e'
...
it means that the lexer cannot construct a token for the input ( type...
in this case). In your case, it means the lexer cannot create a token from the input in the mode it at that moment is in.
I tried to make a change to prevent uppercase characters in prefixes but permit them in the local name but this snowballed and other rules started applying
There are two options to resolve such things:
document
: reference* type* EOF
;
reference
: K_REFERENCE LOWER_ID COL STRING SCOL
;
type
: K_TYPE LOWER_ID COL id OPAR CPAR
;
id
: LOWER_ID
| ID
;
K_REFERENCE : 'reference';
K_TYPE : 'type';
LOWER_ID : [a-z_] [a-z_0-9]*;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
STRING : '"' ~["]* '"';
SCOL : ';';
COL : ':';
OPAR : '{';
CPAR : '}';
SPACES : [ \t\r\n] -> skip;
Modes are meant to be used for input that really are 2 (or more) languages embedded in each other. For example parsing HTML files: there is content (text) and tags with attributes. From what I see, you're not using it as it is meant to be used, IMO.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.