简体   繁体   中英

Running Antlr4 parser with lexer grammar gets token recognition errors

I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.

I'm stuck because when I try to run the parser I get "token recognition error"s.

The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.

I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):

grammar Junk;

options {
  language = Java;
  tokenVocab=JLexer;
}

term : TERM '\r\n'; 

I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).

Here's the lexer (JLexer.g4):

lexer grammar JLexer;

TERM : TERM_START_CHAR TERM_CHAR* ;

TERM_START_CHAR : [abc] ;  
TERM_CHAR : [efg] ; 
WS  : [ \t\n\r\u3000]+ -> skip;

If I copy the lexer code into the parser, then things work as expected (eg, "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).

If I run the parser ("grun Junk term -tokens"), then I get:

line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[@0,4:5='\r\n',<'
'>,1:4]

I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.

I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.

Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.

When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too

parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff

Now you can run your tests

> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[@0,0:3='aeee',<TERM>,1:0]
[@1,6:5='<EOF>',<EOF>,2:0]
>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM