简体   繁体   English

ANTLR中的空格有什么问题?

[英]What is wrong with whitespaces in ANTLR?

I have really simple XML (HTML) parsing ANTLR grammar: 我有非常简单的XML(HTML)解析ANTLR语法:

wiki: ggg+;

ggg: tag | text;

tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';

text: tx=TEXT { System.out.println($tx.getText()); };

CHAR: ~('<'|'>');
TEXT: CHAR+;

With such input: "<ggg> fff" it works fine. 使用这样的输入: "<ggg> fff"可以正常工作。

But when I start to deal with whitespaces it fails. 但是,当我开始处理空格时,它就会失败。 For example: 例如:

  • " <ggg> fff " - fails at beggining " <ggg> fff " -开始失败
  • "<ggg> <hhh> " - fails after <ggg> "<ggg> <hhh> " -在<ggg>之后失败
  • "<ggg> fff " - works fine "<ggg> fff " -工作正常
  • "<ggg> " - fails at end "<ggg> " -末尾失败

I don't know what is wrong. 我不知道怎么了 Maybe there is some special grammar option to handle this. 也许有一些特殊的语法选项可以解决这个问题。 ANTLRWorks gives me NoViableAltException . ANTLRWorks给我NoViableAltException

ANTLR's lexer rules match as much as possible. ANTLR的词法分析器规则尽可能匹配。 Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will "win". 仅当2个(或更多)规则匹配相同数量的字符时,首先定义的规则才会“获胜”。 Because of that, a single character other than '<' and '>' is tokenized as a CHAR token, and not as TEXT token, regardless of what the parser "needs" (the lexer operates independently from the parser, remember that!). 因此,无论解析器“需要”什么,除了'<''>'以外的单个字符都被标记为CHAR令牌,而不是TEXT令牌(词法分析器独立于解析器运行,请记住!) 。 Only two or more characters other than '<' and '>' are being tokenized as a (single) TEXT token. 除了'<''>'以外,只有两个或多个字符被标记为(单个) TEXT标记。

So, therefor the input " <ggg> fff " creates the following 5 tokens: 因此,输入" <ggg> fff "创建以下5个令牌:

type    | text
--------+-----------
CHAR    |   ' '
'<'     |   '<'
TEXT    |   'ggg'
'>'     |   '>'
TEXT    |   ' fff '

And since the token CHAR is not accounted for in your parser rule(s), the parse fails. 并且由于在解析器规则中未考虑令牌CHAR ,因此解析失败。

Simply remove CHAR and do: 只需删除CHAR并执行:

TEXT : ~('<'|'>')+;

You have no token to deal with the space. 您没有令牌来处理该空间。 A space for a lexer is no different from any other character it may encounter. 词法分析器的空间与其可能遇到的任何其他字符没有什么不同。

If whitespace is unimportant you can simply use: 如果空格不重要,则可以使用:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+    { $channel = HIDDEN; } ;

If whitespace is important to you: 如果空格对您很重要:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+
CHAR: ~('<'|'>');
TEXT: (CHAR|WHITESPACE)+;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM