[英]What is wrong with whitespaces in ANTLR?
I have really simple XML (HTML) parsing ANTLR grammar: 我有非常简单的XML(HTML)解析ANTLR语法:
wiki: ggg+;
ggg: tag | text;
tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';
text: tx=TEXT { System.out.println($tx.getText()); };
CHAR: ~('<'|'>');
TEXT: CHAR+;
With such input: "<ggg> fff"
it works fine. 使用这样的输入: "<ggg> fff"
可以正常工作。
But when I start to deal with whitespaces it fails. 但是,当我开始处理空格时,它就会失败。 For example: 例如:
" <ggg> fff "
- fails at beggining " <ggg> fff "
-开始失败 "<ggg> <hhh> "
- fails after <ggg>
"<ggg> <hhh> "
-在<ggg>
之后失败 "<ggg> fff "
- works fine "<ggg> fff "
-工作正常 "<ggg> "
- fails at end "<ggg> "
-末尾失败 I don't know what is wrong. 我不知道怎么了 Maybe there is some special grammar option to handle this. 也许有一些特殊的语法选项可以解决这个问题。 ANTLRWorks gives me NoViableAltException
. ANTLRWorks给我NoViableAltException
。
ANTLR's lexer rules match as much as possible. ANTLR的词法分析器规则尽可能匹配。 Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will "win". 仅当2个(或更多)规则匹配相同数量的字符时,首先定义的规则才会“获胜”。 Because of that, a single character other than '<'
and '>'
is tokenized as a CHAR
token, and not as TEXT
token, regardless of what the parser "needs" (the lexer operates independently from the parser, remember that!). 因此,无论解析器“需要”什么,除了'<'
和'>'
以外的单个字符都被标记为CHAR
令牌,而不是TEXT
令牌(词法分析器独立于解析器运行,请记住!) 。 Only two or more characters other than '<'
and '>'
are being tokenized as a (single) TEXT
token. 除了'<'
和'>'
以外,只有两个或多个字符被标记为(单个) TEXT
标记。
So, therefor the input " <ggg> fff "
creates the following 5 tokens: 因此,输入" <ggg> fff "
创建以下5个令牌:
type | text
--------+-----------
CHAR | ' '
'<' | '<'
TEXT | 'ggg'
'>' | '>'
TEXT | ' fff '
And since the token CHAR
is not accounted for in your parser rule(s), the parse fails. 并且由于在解析器规则中未考虑令牌CHAR
,因此解析失败。
Simply remove CHAR
and do: 只需删除CHAR
并执行:
TEXT : ~('<'|'>')+;
You have no token to deal with the space. 您没有令牌来处理该空间。 A space for a lexer is no different from any other character it may encounter. 词法分析器的空间与其可能遇到的任何其他字符没有什么不同。
If whitespace is unimportant you can simply use: 如果空格不重要,则可以使用:
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
If whitespace is important to you: 如果空格对您很重要:
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+
CHAR: ~('<'|'>');
TEXT: (CHAR|WHITESPACE)+;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.