简体   繁体   中英

What's the best way to handle optional tokens in antlr4

Suppose I have following input:

Great University
Graduated in 2010
Some University
09/2009 - 06/2011
Nice University
06/2011

I want to handle years of studying. My grammar looks like that:

education:
    (section)*
    EOF
    ;

section:
    (school | years)+
   ;

degree:     WORD* DEGREE WORD* SEPARATOR;
years:      WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;
WS          : [ \t\r]+ -> skip;
SEPARATOR   : (NEWLINE | COMMA);
COMMA       : ',';
NEWLINE     : '\n';
SCHOOL      : ('university' | 'University' | 'school' | 'School');
WORD        : [a-zA-Z'()]+;
YEAR_START  : YEAR;
YEAR_END    : YEAR;
YEAR        : (DIGIT DIGIT '/')? [1-2] DIGIT DIGIT DIGIT;
DIGIT       : [0-9];

I'm getting following errors:

line 1:17 mismatched input '\n' expecting '-'
line 6:17 mismatched input '\n' expecting '-'

How can I handle optional start year via grammar?

The lexer can assign only one token type to one pattern. You expect it to assign a year pattern to three token types and to decide at runtime which one is the correct one. This is not how ANTLR works.

In your case all years (not only the optional one) will be captured by the first rule, ie YEAR_START . This means following tokenization

"Graduated in 2010" -> WORD WORD YEAR_START

The only matching rule is

 years:      WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;

but the '-' is missing.

The grammar should work if you delete the YEAR_START and YEAR_END rules and replace all occurrences by YEAR . Probably YEAR_START and YEAR_END have the purpose to distinguish start and end, yet for this purpose there exist labels.

If this does not work, please post your complete grammar; the one you posted does eg not contain a rule for DEGREE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM