ANTLR中的空格有什么问题？

Question

I have really simple XML (HTML) parsing ANTLR grammar: 我有非常简单的XML（HTML）解析ANTLR语法：

wiki: ggg+;

ggg: tag | text;

tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';

text: tx=TEXT { System.out.println($tx.getText()); };

CHAR: ~('<'|'>');
TEXT: CHAR+;

With such input: "<ggg> fff" it works fine. 使用这样的输入： "<ggg> fff"可以正常工作。

But when I start to deal with whitespaces it fails. 但是，当我开始处理空格时，它就会失败。 For example: 例如：

" <ggg> fff " - fails at beggining " <ggg> fff " -开始失败
"<ggg> <hhh> " - fails after <ggg> "<ggg> <hhh> " -在<ggg>之后失败
"<ggg> fff " - works fine "<ggg> fff " -工作正常
"<ggg> " - fails at end "<ggg> " -末尾失败

I don't know what is wrong. 我不知道怎么了 Maybe there is some special grammar option to handle this. 也许有一些特殊的语法选项可以解决这个问题。 ANTLRWorks gives me NoViableAltException . ANTLRWorks给我NoViableAltException 。

Answer 1

ANTLR's lexer rules match as much as possible. ANTLR的词法分析器规则尽可能匹配。 Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will "win". 仅当2个（或更多）规则匹配相同数量的字符时，首先定义的规则才会“获胜”。 Because of that, a single character other than '<' and '>' is tokenized as a CHAR token, and not as TEXT token, regardless of what the parser "needs" (the lexer operates independently from the parser, remember that!). 因此，无论解析器“需要”什么，除了'<'和'>'以外的单个字符都被标记为CHAR令牌，而不是TEXT令牌（词法分析器独立于解析器运行，请记住！）。 Only two or more characters other than '<' and '>' are being tokenized as a (single) TEXT token. 除了'<'和'>'以外，只有两个或多个字符被标记为（单个） TEXT标记。

So, therefor the input " <ggg> fff " creates the following 5 tokens: 因此，输入" <ggg> fff "创建以下5个令牌：

type    | text
--------+-----------
CHAR    |   ' '
'<'     |   '<'
TEXT    |   'ggg'
'>'     |   '>'
TEXT    |   ' fff '

And since the token CHAR is not accounted for in your parser rule(s), the parse fails. 并且由于在解析器规则中未考虑令牌CHAR ，因此解析失败。

Simply remove CHAR and do: 只需删除CHAR并执行：

TEXT : ~('<'|'>')+;

Answer 2

You have no token to deal with the space. 您没有令牌来处理该空间。 A space for a lexer is no different from any other character it may encounter. 词法分析器的空间与其可能遇到的任何其他字符没有什么不同。

If whitespace is unimportant you can simply use: 如果空格不重要，则可以使用：

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+    { $channel = HIDDEN; } ;

If whitespace is important to you: 如果空格对您很重要：

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+
CHAR: ~('<'|'>');
TEXT: (CHAR|WHITESPACE)+;

ANTLR中的空格有什么问题？

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-06-24 14:04:28

解决方案2
1 2012-06-24 13:20:37

ANTLR中的空格有什么问题？

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-06-24 14:04:28

解决方案2 1 2012-06-24 13:20:37

解决方案1
3 已采纳 2012-06-24 14:04:28

解决方案2
1 2012-06-24 13:20:37