简体   繁体   中英

Matching arbitrary text (both symbols and spaces) with ANTLR?

How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?

My grammar is follows:

grammar Anytext;

line :
    comment;

comment : '#' anytext;

anytext: ANY*;

WS : [ \t\r\n]+;

ANY : .;

And my code is follows:

    String line = "# This_is_a_comment";

    ANTLRInputStream input = new ANTLRInputStream(line);

    AnytextLexer lexer = new AnytextLexer(input);

    CommonTokenStream tokens = new CommonTokenStream(lexer);

    AnytextParser parser = new AnytextParser(tokens);

    ParseTree tree = parser.comment();

    System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Output follows:

line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext   T h i s _ i s _ a _ c o m m e n t))

If I change ANY rule

ANY : [ \t\r\n.];

it stops recognizing any symbol at all.

UPDATE1

I have no end line character at the end.

UPDATE 2

So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.

But the question persists.

How to match all symbols at parser level then?

Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext rule, I would write

infoline :
    ( codepoint WS 'field1' WS field1Value ) |
    ( codepoint WS 'field2' WS field2Value ) |
    ( codepoint WS anytext );

here I am parsing rows if 2nd column contains field1 and field2 values and ignore rows otherwise.

How to accomplish this approach?

It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.

T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment

WS : [ \t\r\n]+;

ANY : .;

For your input, the tokens are the following:

  1. # (type T__0 )
  2. [space] (type WS )
  3. T (type ANY )
  4. h (type ANY )
  5. i (type ANY )
  6. s (type ANY )
  7. _ (type ANY )
  8. i (type ANY )
  9. s (type ANY )
  10. _ (type ANY )
  11. a (type ANY )
  12. _ (type ANY )
  13. c (type ANY )
  14. o (type ANY )
  15. m (type ANY )
  16. m (type ANY )
  17. e (type ANY )
  18. n (type ANY )
  19. t (type ANY )

Your current grammar fails to parse because the WS token isn't allowed in the comment rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:

// remember that '#' is its own token
anytext: (ANY | WS | '#')*;

What you could do is change comment to be a lexer rule, which consumes the # character along with whatever follows (in this case, to the end of the line):

grammar Anytext;

line : COMMENT;

COMMENT : '#' ~[\r\n]*;

WS : [ \t\r\n]+;

ANY : .;

Use following rule for line comments:

LINE_COMMENT
    :   '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).

Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:

LINE_COMMENT
    :   '#' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN)
    ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM