Why can my ANTLR4 grammar not parse this text?

Question

I want to be able to parse the following text using ANTLR4:

six-buffers() {
    evil-window-split();
    evil-window-vsplit();
    evil-window-vsplit();
    evil-window-down(1);
    evil-window-vsplit();
    evil-window-vsplit();
};
six-buffers();

First I define a function, then I call it.

To do so, I defined the following grammar :

grammar Deplorable;

script: statement*;
statement: (methodCall | functionDeclaration) ';' (WHITESPACE|NEW_LINE);

// General stuff
deplorableString: '"' DEPLORABLE_STRING* '"';
deplorableInteger: DEPLORABLE_NUMBER;

// Method call definition
methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN;

methodName: DEPLORABLE_IDENTIFIER;
methodArgument: (deplorableString | deplorableInteger);

// Function Declaration
functionStatement: methodCall ';' (WHITESPACE|NEW_LINE);
functionDeclaration: methodName LPAREN RPAREN functionBody;
functionBody: CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT;

// Lexer stuff
LPAREN: '(';
RPAREN: ')';
DEPLORABLE_IDENTIFIER: (LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | DASH)+;
DEPLORABLE_STRING: (LOWERCASE_LATIN_LETTER | UPPERCASE_LATIN_LETTER | UNDERSCORE | WHITESPACE | EXCLAMATION_POINT)+;

CURLY_BRACE_LEFT: '{';
CURLY_BRACE_RIGHT: '}';

NEW_LINE: ('\r\n'|'\n'|'\r');

DEPLORABLE_NUMBER: DIGIT+;

fragment COMMA: ',';

fragment DASH: '-';
fragment LOWERCASE_LATIN_LETTER: 'a'..'z';
fragment UPPERCASE_LATIN_LETTER: 'A'..'Z';
fragment UNDERSCORE: '_';
fragment WHITESPACE: ' ';
fragment EXCLAMATION_POINT: '!';
fragment DIGIT: '0'..'9';

I compile this grammar using mvn clean antlr4:antlr4 install (with disabled tests). Here is my pom.xml file.

However, when I try to parse the above text in a test , I am getting the error

line 1:13 no viable alternative at input 'six-buffers() '

I tried to add void in front of a function declaration so that the parser can distinguish between function declarations and function calls, but this did not help.

How can I fix this error, ie make sure that the parser correctly recognizes a function declaration and does not mistake it for a function call?

Update 1: This version of the grammar (inspired by Mike Cargal) seems to work for now:

grammar Deplorable;

script: statement*;
statement: (methodCall | functionDeclaration) ';';

// General stuff

// Method call definition
methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN;

methodName: DEPLORABLE_IDENTIFIER;
methodArgument: (DEPLORABLE_STRING | DEPLORABLE_NUMBER);

// Function Declaration
functionStatement: methodCall ';';
functionDeclaration: methodName LPAREN RPAREN functionBody;
functionBody: CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT;

// Lexer stuff
LPAREN: '(';
RPAREN: ')';
DEPLORABLE_IDENTIFIER: (
        LOWERCASE_LATIN_LETTER
        | UPPERCASE_LATIN_LETTER
        | UNDERSCORE
        | DASH
    )+;
DEPLORABLE_STRING: '"' (
        LOWERCASE_LATIN_LETTER
        | UPPERCASE_LATIN_LETTER
        | UNDERSCORE
        | WHITESPACE
        | EXCLAMATION_POINT
    )+ '"';

CURLY_BRACE_LEFT: '{';
CURLY_BRACE_RIGHT: '}';

NEW_LINE: (
    '\r' '\n'?
    | '\n'
) -> skip;

DEPLORABLE_NUMBER: DIGIT+;

fragment COMMA: ',';

fragment DASH: '-';
fragment LOWERCASE_LATIN_LETTER: 'a'..'z';
fragment UPPERCASE_LATIN_LETTER: 'A'..'Z';
fragment UNDERSCORE: '_';
WHITESPACE: [ \t]+ -> skip;
fragment EXCLAMATION_POINT: '!';
fragment DIGIT: '0'..'9';

Answer 1

@sepp2k is pointing you the right direction.

Your Lexer rules (particularly DEPLORABLE_STRING) are causing your pain. More specifically, this looks like the misconception a lot of people have, early in using ANTLR, that a Parser rule can have anything to do with tokenization.

In the ANTLR pipeline, your input is first tokenized into a stream of tokens using the Lexer rules. So dumping out your stream of tokens is frequently very helpful.

in your case, the stream looks like this:

[@0,0:10='six-buffers',<DEPLORABLE_IDENTIFIER>,1:0]
[@1,11:11='(',<'('>,1:11]
[@2,12:12=')',<')'>,1:12]
[@3,13:13=' ',<DEPLORABLE_STRING>,1:13]
[@4,14:14='{',<'{'>,1:14]
[@5,15:15='\n',<NEW_LINE>,1:15]
[@6,16:23='    evil',<DEPLORABLE_STRING>,2:0]
[@7,24:36='-window-split',<DEPLORABLE_IDENTIFIER>,2:8]
[@8,37:37='(',<'('>,2:21]
[@9,38:38=')',<')'>,2:22]
[@10,39:39=';',<';'>,2:23]
[@11,40:40='\n',<NEW_LINE>,2:24]
[@12,41:48='    evil',<DEPLORABLE_STRING>,3:0]
[@13,49:62='-window-vsplit',<DEPLORABLE_IDENTIFIER>,3:8]
[@14,63:63='(',<'('>,3:22]
[@15,64:64=')',<')'>,3:23]
[@16,65:65=';',<';'>,3:24]
[@17,66:66='\n',<NEW_LINE>,3:25]
[@18,67:74='    evil',<DEPLORABLE_STRING>,4:0]
[@19,75:88='-window-vsplit',<DEPLORABLE_IDENTIFIER>,4:8]
[@20,89:89='(',<'('>,4:22]
[@21,90:90=')',<')'>,4:23]
[@22,91:91=';',<';'>,4:24]
[@23,92:92='\n',<NEW_LINE>,4:25]
[@24,93:100='    evil',<DEPLORABLE_STRING>,5:0]
[@25,101:112='-window-down',<DEPLORABLE_IDENTIFIER>,5:8]
[@26,113:113='(',<'('>,5:20]
[@27,114:114='1',<DEPLORABLE_NUMBER>,5:21]
[@28,115:115=')',<')'>,5:22]
[@29,116:116=';',<';'>,5:23]
[@30,117:117='\n',<NEW_LINE>,5:24]
[@31,118:125='    evil',<DEPLORABLE_STRING>,6:0]
[@32,126:139='-window-vsplit',<DEPLORABLE_IDENTIFIER>,6:8]
[@33,140:140='(',<'('>,6:22]
[@34,141:141=')',<')'>,6:23]
[@35,142:142=';',<';'>,6:24]
[@36,143:143='\n',<NEW_LINE>,6:25]
[@37,144:151='    evil',<DEPLORABLE_STRING>,7:0]
[@38,152:165='-window-vsplit',<DEPLORABLE_IDENTIFIER>,7:8]
[@39,166:166='(',<'('>,7:22]
[@40,167:167=')',<')'>,7:23]
[@41,168:168=';',<';'>,7:24]
[@42,169:169='\n',<NEW_LINE>,7:25]
[@43,170:170='}',<'}'>,8:0]
[@44,171:171=';',<';'>,8:1]
[@45,172:172='\n',<NEW_LINE>,8:2]
[@46,173:183='six-buffers',<DEPLORABLE_IDENTIFIER>,9:0]
[@47,184:184='(',<'('>,9:11]
[@48,185:185=')',<')'>,9:12]
[@49,186:186=';',<';'>,9:13]
[@50,187:186='<EOF>',<EOF>,9:14]

You'll notice that @3,13 a single ' ' is being tokenized as a DEPLORABLE_STRING.

You'll need to incorporate the quotation marks into your DEPLORABLE_STRING rule.

(also suggest you skip WHITESPACE (and probably NEW_LINE (most grammars treat NEW_LINEs as WHITESPACE)

Something like this should get you "unstuck"

grammar Deplorable;

script: statement*;
statement: (methodCall | functionDeclaration) ';' (
        WHITESPACE
        | NEW_LINE
    );

// General stuff deplorableString: '"' DEPLORABLE_STRING* '"'; deplorableInteger: DEPLORABLE_NUMBER;

// Method call definition
methodCall: methodName LPAREN (methodArgument COMMA?)* RPAREN;

methodName: DEPLORABLE_IDENTIFIER;
methodArgument: (DEPLORABLE_STRING | DEPLORABLE_NUMBER);

// Function Declaration
functionStatement: methodCall ';' (WHITESPACE | NEW_LINE);
functionDeclaration: methodName LPAREN RPAREN functionBody;
functionBody:
    CURLY_BRACE_LEFT functionStatement* CURLY_BRACE_RIGHT;

// Lexer stuff
LPAREN: '(';
RPAREN: ')';
DEPLORABLE_IDENTIFIER: (
        LOWERCASE_LATIN_LETTER
        | UPPERCASE_LATIN_LETTER
        | UNDERSCORE
        | DASH
    )+;
DEPLORABLE_STRING:
    '"' (
        LOWERCASE_LATIN_LETTER
        | UPPERCASE_LATIN_LETTER
        | UNDERSCORE
        | WHITESPACE
        | EXCLAMATION_POINT
    )+ '"';

CURLY_BRACE_LEFT: '{';
CURLY_BRACE_RIGHT: '}';

NEW_LINE: ('\r\n' | '\n' | '\r');

DEPLORABLE_NUMBER: DIGIT+;

fragment COMMA: ',';

fragment DASH: '-';
fragment LOWERCASE_LATIN_LETTER: 'a' ..'z';
fragment UPPERCASE_LATIN_LETTER: 'A' ..'Z';
fragment UNDERSCORE: '_';
fragment WHITESPACE: ' ' -> skip;
fragment EXCLAMATION_POINT: '!';
fragment DIGIT: '0' ..'9';

That's still tripping on an extraneous \n (hence my comment re: WS and NL handling). Not sure your intention, but take a look at how other grammars handle it. It usually MUCH easier to skip them, than to account for everywhere in the parser rules where they might occur.

Most importantly... get your thought model right about what the ANTLR process of processing your stream of characters into a stream of tokens (using Lexer rules) and then using parser rules to process the stream of tokens. You'll be in a for a lot of pain until that's clear for you.

Why can my ANTLR4 grammar not parse this text?

Question

1 answers

solution1
1 ACCPTED 2021-02-16 11:55:43

Why can my ANTLR4 grammar not parse this text?

Question

1 answers

solution1 1 ACCPTED 2021-02-16 11:55:43

solution1
1 ACCPTED 2021-02-16 11:55:43