简体   繁体   English

ANTLR规则使用固定数量的字符

[英]ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. 我正在尝试为PHP serialize()格式编写一个ANTLR语法,除了字符串之外,一切似乎都能正常工作。 The problem is that the format of serialized strings is : 问题是序列化字符串的格式是:

s:6:"length";

In terms of regexes, a rule like s:(\\d+):".{\\1}"; 就正则表达而言,像s:(\\d+):".{\\1}";这样s:(\\d+):".{\\1}";规则s:(\\d+):".{\\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not). 如果在“匹配数”计数中只允许反向引用(但它们不是),则会描述这种格式。

But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (ie 6HLength ), not on a string delimiter. 但我无法找到一种方法来表达词法分析器或解析器语法:整个想法是使读取的字符数取决于描述要读取的字符数的反向引用,如Fortran Hollerith常量(即6HLength ),不在字符串分隔符上。

This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. 这个来自FortranANTLR语法的例子似乎指明了方向,但我不知道如何。 Note that my target language is Python, while most of the doc and examples are for Java: 请注意,我的目标语言是Python,而大多数文档和示例都是针对Java的:

// numeral literal
ICON {int counter=0;} :
    /* other alternatives */
    // hollerith
    'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
      {
      $setType(HOLLERITH);
      String str = $getText;
      str = str.replaceFirst("([0-9])+h", "");
      $setText(str);
      }
    /* more alternatives */
    ;

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case. 由于输入像s:3:"a"b";是有效的,你不能在词法分析器中定义一个String标记,除非第一个和最后一个双引号总是字符串的开头和结尾。但我想这是不是这样的。

So, you'll need a lexer rule like this: 所以,你需要像这样的词法分析器规则:

SString
  :  's:' Int ':"' ( . )* '";'
  ;

In other words: match a s: , then an integer value followed by :" then one or more characters that can be anything, ending with "; 换句话说:匹配一个s: ,然后是一个integer数值,后跟:"然后是一个或多个可以是任何东西的字符,以";结尾"; . But you need to tell the lexer to stop consuming when the value Int is not reached. 但是,当未达到Int值时,您需要告诉词法分析器停止使用。 You can do that by mixing some plain code in your grammar to do so. 你可以通过在语法中混合一些简单的代码来做到这一点。 You can embed plain code by wrapping it inside { and } . 你可以通过将它包装在{}来嵌入普通代码。 So first convert the value the token Int holds into an integer variable called chars : 因此,首先将令牌Int保存的值转换为名为chars的整数变量:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
  ;

Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero: 现在在( . )*循环中嵌入一些代码,以便在chars数减至零时立即停止消耗:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

and that's it. 就是这样。

A little demo grammar: 一个小的演示语法:

grammar Test;

options {
  language=Python;
}

parse
  :  (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
  ;

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

Int
  :  '0'..'9'+
  ;

(note that you need to escape the % inside your grammar!) (注意你需要在你的语法中逃避% !)

And a test script: 还有一个测试脚本:

import antlr3
from TestLexer import TestLexer
from TestParser import TestParser

input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()

which produces the following output: 产生以下输出:

parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM