[英]How do I treat <script> tags differently in simple ANTLR lexer?
I'm writing a really simple lexer for doing syntax highlighting of arbitrary text, one of which is HTML. 我正在编写一个非常简单的词法分析器,用于对任意文本进行语法高亮显示,其中之一就是HTML。 The goal of the lexer is just to provide a flat stream of tokens.
词法分析器的目的只是提供令牌的统一流。
I started with the XML tutorial on the Antlr3 website, but am having some trouble with script tags. 我从Antlr3网站上的XML教程开始,但是在脚本标记方面遇到了一些麻烦。
An example of the HTML which causes this problem: 导致此问题的HTML的示例:
<head> <script>alert(2 < 3);</script> </head>
And the grammar.. 还有语法
@members {
boolean inTag = false;
}
TAG_START_OPEN : '<'
{ inTag = true; } ;
TAG_END_OPEN : '</'
{ inTag = true; } ;
TAG_CLOSE : { inTag }?=> '>' { inTag = false; } ;
TAG_SELF_CLOSE : { inTag }?=> '/>' { inTag = false; } ;
PCDATA : { !inTag }?=> (~'<')+ ;
// ...
The problem is that the lexer gets confused when seeing the '<' tag within the Javascript code and thinks it is a close tag. 问题在于,当在Javascript代码中看到'<'标签时,词法分析器会感到困惑,并认为它是一个close标签。 I guess the goal would be for the lexer to use lookahead to determine whether a '<' is proceeded by '/script>' if the open tag was a script tag, however I'm unsure of how to do this nicely with ANTLR.
我猜目标是让词法分析器使用前瞻性来确定如果open标签是脚本标签,则'<'是否以'/ script>'开头,但是我不确定如何使用ANTLR做到这一点。
Thanks in advance for any help. 在此先感谢您的帮助。
Here's a quick demo of how you could accomplish this: 这是如何实现此目的的快速演示:
grammar T;
options {
output=AST;
}
tokens {
DATA;
ATTRIBUTES;
ATTRIBUTE;
ATOMS;
}
@lexer::members {
private boolean inTag = false;
private boolean isScript = false;
private boolean ahead(String s) {
for(int i = 0; i < s.length(); i++) {
int ch = input.LA(i + 1);
if(ch != s.charAt(i)) {
return false;
}
}
return true;
}
}
parse
: tag EOF -> tag
;
tag
: TagOpen attributes TagOpenEnd atoms TagClose -> ^(TagOpen attributes atoms)
;
attributes
: attribute* -> ^(ATTRIBUTES attribute*)
;
attribute
: Key Assign Value -> ^(ATTRIBUTE Key Value)
;
atoms
: atom* -> ^(ATOMS atom*)
;
atom
: PCData
| ScriptData
| tag
;
TagOpen
: '<' Name
{
inTag=true;
isScript = $Name.text.equals("script");
setText($Name.text);
}
;
TagClose
: {!inTag}?=> '</' Name '>'
{
isScript = false;
setText($Name.text);
}
;
TagOpenEnd
: {inTag}?=> '>' {inTag=false;}
;
Key
: {inTag}?=> Name
;
Assign
: {inTag}?=> '='
;
Value
: {inTag}?=> '"' ~'"'* '"'
{
setText($text.substring(1, $text.length() - 1));
}
;
PCData
: {!inTag && !isScript}?=> ~'<'+
{
if($text.trim().isEmpty()) {
skip();
}
}
;
ScriptData
: {!inTag && isScript}?=> ({!ahead("</script>")}?=> . )+
;
Space
: {inTag}?=> (' ' | '\t' | '\r' | '\n')+ {skip();}
;
fragment Name : ('a'..'z' | 'A'..'Z')+;
If I now parse the input: 如果我现在解析输入:
<head>
<script> alert(2 < 3); </script>
<span key="some value" x="<>">
Mu <em>foo</em> bar!
</span>
</head>
the following AST will be created by the generated parser: 生成的解析器将创建以下AST:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.