我该怎么治疗<script> tags differently in simple ANTLR lexer?

Question

I'm writing a really simple lexer for doing syntax highlighting of arbitrary text, one of which is HTML. 我正在编写一个非常简单的词法分析器，用于对任意文本进行语法高亮显示，其中之一就是HTML。 The goal of the lexer is just to provide a flat stream of tokens. 词法分析器的目的只是提供令牌的统一流。

I started with the XML tutorial on the Antlr3 website, but am having some trouble with script tags. 我从Antlr3网站上的XML教程开始，但是在脚本标记方面遇到了一些麻烦。

An example of the HTML which causes this problem: 导致此问题的HTML的示例：

<head> <script>alert(2 < 3);</script> </head>

And the grammar.. 还有语法

@members {
    boolean inTag = false;
}

TAG_START_OPEN : '<'
                 { inTag = true; } ;
TAG_END_OPEN : '</'
               { inTag = true; } ;

TAG_CLOSE : { inTag }?=> '>' { inTag = false; } ;
TAG_SELF_CLOSE : { inTag }?=> '/>' { inTag = false; } ;
PCDATA : { !inTag }?=> (~'<')+ ;

// ...

The problem is that the lexer gets confused when seeing the '<' tag within the Javascript code and thinks it is a close tag. 问题在于，当在Javascript代码中看到'<'标签时，词法分析器会感到困惑，并认为它是一个close标签。 I guess the goal would be for the lexer to use lookahead to determine whether a '<' is proceeded by '/script>' if the open tag was a script tag, however I'm unsure of how to do this nicely with ANTLR. 我猜目标是让词法分析器使用前瞻性来确定如果open标签是脚本标签，则'<'是否以'/ script>'开头，但是我不确定如何使用ANTLR做到这一点。

Thanks in advance for any help. 在此先感谢您的帮助。

Answer 1

Here's a quick demo of how you could accomplish this: 这是如何实现此目的的快速演示：

grammar T;

options {
  output=AST;
}

tokens {
  DATA;
  ATTRIBUTES;
  ATTRIBUTE;
  ATOMS;
}

@lexer::members {
  private boolean inTag = false;
  private boolean isScript = false;

  private boolean ahead(String s) {
    for(int i = 0; i < s.length(); i++) {
      int ch = input.LA(i + 1);
      if(ch != s.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

parse
 : tag EOF -> tag
 ;

tag
 : TagOpen attributes TagOpenEnd atoms TagClose -> ^(TagOpen attributes atoms)
 ;

attributes
 : attribute* -> ^(ATTRIBUTES attribute*)
 ;

attribute
 : Key Assign Value -> ^(ATTRIBUTE Key Value)
 ;

atoms
 : atom* -> ^(ATOMS atom*)
 ;

atom
 : PCData
 | ScriptData
 | tag
 ;

TagOpen
 : '<' Name 
   {
     inTag=true; 
     isScript = $Name.text.equals("script");
     setText($Name.text);
   }
 ;

TagClose
 : {!inTag}?=> '</' Name '>' 
   {
     isScript = false;
     setText($Name.text);
   }
 ;

TagOpenEnd
 : {inTag}?=> '>' {inTag=false;}
 ;

Key
 : {inTag}?=> Name
 ;

Assign
 : {inTag}?=> '='
 ;

Value
 : {inTag}?=> '"' ~'"'* '"'
   {
     setText($text.substring(1, $text.length() - 1));
   }
 ;

PCData
 : {!inTag && !isScript}?=> ~'<'+
   {
     if($text.trim().isEmpty()) {
       skip();
     }
   }
 ;

ScriptData
 : {!inTag && isScript}?=> ({!ahead("</script>")}?=> . )+
 ;

Space
 : {inTag}?=> (' ' | '\t' | '\r' | '\n')+ {skip();}
 ;

fragment Name : ('a'..'z' | 'A'..'Z')+;

If I now parse the input: 如果我现在解析输入：

<head>
  <script> alert(2 < 3); </script> 
  <span key="some value" x="<>">
    Mu <em>foo</em> bar!
  </span>
</head>

the following AST will be created by the generated parser: 生成的解析器将创建以下AST：

在此处输入图片说明

我该怎么治疗<script> tags differently in simple ANTLR lexer?

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-01-17 20:20:48

我该怎么治疗<script> tags differently in simple ANTLR lexer?

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-01-17 20:20:48

解决方案1
1 已采纳 2013-01-17 20:20:48