简体   繁体   English

reStructuredText的ANTLR语法(规则优先级)

[英]ANTLR grammar for reStructuredText (rule priorities)

First question stream 第一个问题流

Hello everyone, 大家好,

This could be a follow-up on this question: Antlr rule priorities 这可能是这个问题的后续行动: Antlr规则优先事项

I'm trying to write an ANTLR grammar for the reStructuredText markup language . 我正在尝试为reStructuredText标记语言编写ANTLR语法。

The main problem I'm facing is : "How to match any sequence of characters (regular text) without masking other grammar rules?" 我面临的主要问题是: “如何在不掩盖其他语法规则的情况下匹配任何字符序列(常规文本)?”

Let's take an example with a paragraph with inline markup: 让我们举一个带内联标记的段落的例子:

In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element 
*before* the insert point. The variable ``after_ptr`` points to the element *after* the 
insert. In other words, we are going to put our new element **in between** ``before_ptr`` 
and ``after_ptr``.

I thought that writing rules for inline markup text would be easy. 我认为编写内联标记文本的规则很容易。 So I wrote a simple grammar: 所以我写了一个简单的语法:

grammar Rst;

options {
    output=AST;
    language=Java;
    backtrack=true;
    //memoize=true;
}

@members {
boolean inInlineMarkup = false;
}

// PARSER

text
    : inline_markup (WS? inline_markup)* WS? EOF
    ;


inline_markup
@after {
inInlineMarkup = false;
}
    : {!inInlineMarkup}? (emphasis|strong|litteral|link)
    ;

emphasis
@init {
inInlineMarkup = true;
}
    : '*' (~'*')+ '*' {System.out.println("emphasis: " + $text);}
    ;

strong
@init {
inInlineMarkup = true;
}
    : '**' (~'*')+ '**' {System.out.println("bold: " + $text);}
    ;

litteral
@init {
inInlineMarkup = true;
}
    : '``' (~'`')+ '``' {System.out.println("litteral: " + $text);}
    ;

link
@init {
inInlineMarkup = true;
}
    : inline_internal_target
    | footnote_reference
    | hyperlink_reference
    ;

inline_internal_target
    : '_`' (~'`')+ '`' {System.out.println("inline_internal_target: " + $text);}
    ;

footnote_reference
    : '[' (~']')+ ']_' {System.out.println("footnote_reference: " + $text);}
    ;


hyperlink_reference
    : ~(' '|'\t'|'\u000C'|'_')+ '_' {System.out.println("hyperlink_reference: " + $text);}
    |   '`' (~'`')+ '`_' {System.out.println("hyperlink_reference (long): " + $text);}
    ;

// LEXER

WS  
  : (' '|'\t'|'\u000C')+
  ; 

NEWLINE
  : '\r'? '\n'
  ;

This simple grammar doesn't work. 这个简单的语法不起作用。 And I didn't even try to match regular text... 我甚至没有尝试匹配常规文本......

My questions: 我的问题:

  • Could someone point to my errors and maybe give me a hint on how to match regular text? 有人可以指出我的错误,也许可以给我一个如何匹配常规文本的提示?
  • Is there a way to set priority on the grammar rules? 有没有办法设置语法规则的优先级? Maybe this could be a lead. 也许这可能是一个领先者。

Thanks in advance for your help :-) 在此先感谢您的帮助 :-)

Robin 知更鸟


Second question stream 第二个问题流

Thank you very much for your help! 非常感谢您的帮助! I would have had a hard time figuring my errors... I'm not writing that grammar (only) to learn ANTLR, I'm trying to code an IDE plugin for eclipse. 我本来很难搞清楚我的错误...我不是在编写那种语法(仅)来学习ANTLR,我正在尝试编写一个用于eclipse的IDE插件。 And for that, I need a grammar ;) 为此,我需要一个语法;)

I managed to go further in the grammar and wrote a text rule: 我设法进一步研究语法并编写了一个text规则:

grammar Rst;

options {
    output=AST;
    language=Java;
}



@members {
boolean inInlineMarkup = false;
}

//////////////////
// PARSER RULES //
//////////////////

file
  : line* EOF
  ;


line
  : text* NEWLINE
  ;

text
    : inline_markup
    | normal_text
    ;

inline_markup
@after {
inInlineMarkup = false;
}
    : {!inInlineMarkup}? {inInlineMarkup = true;} 
  (
  | STRONG
  | EMPHASIS
  | LITTERAL
  | INTERPRETED_TEXT
  | SUBSTITUTION_REFERENCE
  | link
  )
    ;


link
    : INLINE_INTERNAL_TARGET
    | FOOTNOTE_REFERENCE
    | HYPERLINK_REFERENCE
    ;

normal_text
  : {!inInlineMarkup}? 
   ~(EMPHASIS
      |SUBSTITUTION_REFERENCE
      |STRONG
      |LITTERAL
      |INTERPRETED_TEXT
      |INLINE_INTERNAL_TARGET
      |FOOTNOTE_REFERENCE
      |HYPERLINK_REFERENCE
      |NEWLINE
      )
  ;
//////////////////
// LEXER TOKENS //
//////////////////

EMPHASIS
    : STAR ANY_BUT_STAR+ STAR {System.out.println("EMPHASIS: " + $text);}
    ;

SUBSTITUTION_REFERENCE
  : PIPE ANY_BUT_PIPE+ PIPE  {System.out.println("SUBST_REF: " + $text);}
  ;

STRONG
    : STAR STAR ANY_BUT_STAR+ STAR STAR {System.out.println("STRONG: " + $text);}
    ;

LITTERAL
    : BACKTICK BACKTICK ANY_BUT_BACKTICK+ BACKTICK BACKTICK {System.out.println("LITTERAL: " + $text);}
    ;
INTERPRETED_TEXT
  : BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("LITTERAL: " + $text);}
  ;

INLINE_INTERNAL_TARGET
    : UNDERSCORE BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("INLINE_INTERNAL_TARGET: " + $text);}
    ;

FOOTNOTE_REFERENCE
    : L_BRACKET ANY_BUT_BRACKET+ R_BRACKET UNDERSCORE {System.out.println("FOOTNOTE_REFERENCE: " + $text);}
    ;


HYPERLINK_REFERENCE
  : BACKTICK ANY_BUT_BACKTICK+ BACKTICK UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (long): " + $text);}
  | ANY_BUT_ENDLINK+ UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (short): " + $text);}
  ;

WS  
  : (' '|'\t')+ {$channel=HIDDEN;}
  ; 

NEWLINE
  : '\r'? '\n' {$channel=HIDDEN;}
  ;


///////////////
// FRAGMENTS //
///////////////

fragment ANY_BUT_PIPE
  : ESC PIPE
  | ~(PIPE|'\n'|'\r')
  ;
fragment ANY_BUT_BRACKET
  : ESC R_BRACKET
  | ~(R_BRACKET|'\n'|'\r')
  ;
fragment ANY_BUT_STAR
  : ESC STAR
  | ~(STAR|'\n'|'\r')
  ;
fragment ANY_BUT_BACKTICK
  : ESC BACKTICK
  | ~(BACKTICK|'\n'|'\r')
  ;
fragment ANY_BUT_ENDLINK
  : ~(UNDERSCORE|' '|'\t'|'\n'|'\r')
  ;



fragment ESC
  : '\\'
  ;
fragment STAR
  : '*'
  ;
fragment BACKTICK
  : '`'
  ;
fragment PIPE
  : '|'
  ;
fragment L_BRACKET
  : '['
  ;
fragment R_BRACKET
  : ']'
  ;
fragment UNDERSCORE
  : '_'
  ;

The grammar is working fine for inline_markup but normal_text is not matched. 语法对于inline_markup工作正常但是normal_text不匹配。

Here is my test class: 这是我的测试类:

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;

import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.Tree;

public class Test {

    public static void main(String[] args) throws RecognitionException, IOException {

        InputStream is = Test.class.getResourceAsStream("test.rst");
        Reader r = new InputStreamReader(is);
        StringBuilder source = new StringBuilder();
        char[] buffer = new char[1024];
        int readLenght = 0;
        while ((readLenght = r.read(buffer)) > 0) {
            if (readLenght < buffer.length) {
                source.append(buffer, 0, readLenght);
            } else {
                source.append(buffer);
            }
        }
        r.close();
        System.out.println(source.toString());

        ANTLRStringStream in = new ANTLRStringStream(source.toString());
        RstLexer lexer = new RstLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        RstParser parser = new RstParser(tokens);
        RstParser.file_return out = parser.file();
        System.out.println(((Tree)out.getTree()).toStringTree());
    }
}

And the input file I use: 我使用的输入文件:

In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element 
*before* the insert point. The variable ``after_ptr`` points to the |element| *after* the 
insert. In other words, `we are going`_ to put_ our new element **in between** ``before_ptr`` 
and ``after_ptr``.

And I get this output: 我得到这个输出:

HYPERLINK_REFERENCE (short): 7-6`_
line 1:2 mismatched character ' ' expecting '_'
line 1:10 mismatched character ' ' expecting '_'
line 1:18 mismatched character ' ' expecting '_'
line 1:21 mismatched character ' ' expecting '_'
line 1:26 mismatched character ' ' expecting '_'
line 1:37 mismatched character ' ' expecting '_'
LITTERAL: `before_ptr`
line 1:86 no viable alternative at character '\r'
line 1:55 mismatched character ' ' expecting '_'
line 1:60 mismatched character ' ' expecting '_'
line 1:63 mismatched character ' ' expecting '_'
line 1:70 mismatched character ' ' expecting '_'
line 1:73 mismatched character ' ' expecting '_'
line 1:77 mismatched character ' ' expecting '_'
line 1:85 mismatched character ' ' expecting '_'
EMPHASIS: *before*
line 2:12 mismatched character ' ' expecting '_'
line 2:19 mismatched character ' ' expecting '_'
line 2:26 mismatched character ' ' expecting '_'
LITTERAL: `after_ptr`
line 2:30 mismatched character ' ' expecting '_'
line 2:39 mismatched character ' ' expecting '_'
line 2:90 no viable alternative at character '\r'
line 2:60 mismatched character ' ' expecting '_'
line 2:63 mismatched character ' ' expecting '_'
line 2:67 mismatched character ' ' expecting '_'
line 2:77 mismatched character ' ' expecting '_'
line 2:85 mismatched character ' ' expecting '_'
line 2:89 mismatched character ' ' expecting '_'
line 3:7 mismatched character ' ' expecting '_'
line 3:10 mismatched character ' ' expecting '_'
line 3:16 mismatched character ' ' expecting '_'
line 3:23 mismatched character ' ' expecting '_'
line 3:27 mismatched character ' ' expecting '_'
line 3:31 mismatched character ' ' expecting '_'
line 3:42 mismatched character ' ' expecting '_'
line 3:51 mismatched character ' ' expecting '_'
line 3:55 mismatched character ' ' expecting '_'
line 3:63 mismatched character ' ' expecting '_'
line 3:94 mismatched character '\r' expecting '*'
line 4:3 mismatched character ' ' expecting '_'
line 4:18 no viable alternative at character '\r'
line 4:18 mismatched character '\r' expecting '_'
HYPERLINK_REFERENCE (short): oing`_
HYPERLINK_REFERENCE (short): ut_
EMPHASIS: *in between*
LITTERAL: `after_ptr`
BR.recoverFromMismatchedToken
line 0:-1 mismatched input '<EOF>' expecting NEWLINE
null

Can you point to my error(s)? 你能指出我的错误吗? (the parser works for inline markup without errors when I add the filter=true; option to the grammar) (当我添加filter = true;语法选项时,解析器适用于内联标记而没有错误)

Robin 知更鸟

Here's a quick demo how you could parse this reStructeredText. 这里有一个快速演示中,你如何解析这个reStructeredText。 Note that it just handles a minor set of all available markup-syntax, and by adding more to it, you will affect the existing parser/lexer rules: so there is much, much more work to be done! 请注意,它只处理所有可用标记语法的一小部分,并且通过向其添加更多内容,您影响现有的解析器/词法分析器规则:因此还有很多工作要做!

Demo 演示

grammar RST;

options {
  output=AST;
  backtrack=true;
  memoize=true;
}

tokens {
  ROOT;
  PARAGRAPH;
  INDENTATION;
  LINE;
  WORD;
  BOLD;
  ITALIC;
  INTERPRETED_TEXT;
  INLINE_LITERAL;
  REFERENCE;
}

parse
  :  paragraph+ EOF -> ^(ROOT paragraph+)
  ;

paragraph
  :  line+ -> ^(PARAGRAPH line+)
  |  Space* LineBreak -> /* omit line-breaks between paragraphs from AST */
  ;

line
  :  indentation text+ LineBreak -> ^(LINE text+)
  ;

indentation
  :  Space* -> ^(INDENTATION Space*)
  ;

text
  :  styledText
  |  interpretedText
  |  inlineLiteral
  |  reference
  |  Space
  |  Star
  |  EscapeSequence
  |  Any
  ;

styledText
  :  bold
  |  italic
  ;

bold
  :  Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+)
  ;  

italic
  :  Star italicAtom+ Star -> ^(ITALIC italicAtom+)
  ;

boldAtom
  :  ~(Star | LineBreak)
  |  italic
  ;

italicAtom
  :  ~(Star | LineBreak)
  |  bold
  ;

interpretedText
  :  BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms)
  ;

interpretedTextAtoms
  :  ~BackTick+
  ;

inlineLiteral
  :  BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms)
  ;

inlineLiteralAtoms
  :  inlineLiteralAtom+
  ;

inlineLiteralAtom
  :  ~BackTick
  |  BackTick ~BackTick
  ;

reference
  :  Any+ UnderScore -> ^(REFERENCE Any+)
  ;

UnderScore
  :  '_'
  ;

BackTick
  :  '`'
  ;

Star
  :  '*'
  ;

Space
  :  ' ' 
  |  '\t'
  ;

EscapeSequence
  :  '\\' ('\\' | '*')
  ;

LineBreak
  :  '\r'? '\n'
  |  '\r'
  ;

Any
  :  .
  ;

When you generate a parser and lexer from the above, and let it parse the following input file: 当您从上面生成解析器和词法分析器时,让它解析以下输入文件:

***x*** **yyy** *zz* *
a b c

P2 ``*a*`b`` `q`
Python_

(note the trailing line break!) (注意尾随换行!)

the parser will produce the following AST: 解析器将生成以下AST:

在此输入图像描述

EDIT 编辑

The graph can be created by running this class: 可以通过运行此类来创建图形:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source =
        "***x*** **yyy** *zz* *\n" +
        "a b c\n" +
        "\n" +
        "P2 ``*a*`b`` `q`\n" +
        "Python_\n";
    RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source));
    RSTParser parser = new RSTParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

or if your source comes from a file, do: 或者如果您的来源来自文件,请执行以下操作:

RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst"));

or 要么

RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???"));

where "???" 哪里"???" is the encoding of your file. 是您的文件的编码。

The class above will print the AST as a DOT file to the console. 上面的类将AST作为DOT文件打印到控制台。 You can use a DOT viewer to display the AST. 您可以使用DOT查看器显示AST。 In this case, I posted an image created by kgraphviewer . 在这种情况下,我发布了由kgraphviewer创建的图像。 But there are many more viewers around . 但是还有更多的观众 A nice online one is this one , which appears to be using kgraphviewer under "the hood". 一个很好的在线版本就是这个 ,它似乎是在“引擎盖”下使用kgraphviewer Good luck! 祝好运!

Robin wrote: 罗宾写道:

I thought that writing rules for inline markup text would be easy 我认为编写内联标记文本的规则很容易

I must admit that I am not familiar with this markup language, but it seems to resemble BB-Code or Wiki markup which are not easily translated into a (ANTLR) grammar! 我必须承认我不熟悉这种标记语言,但它似乎类似于BB-Code或Wiki标记,它们不易翻译成(ANTLR)语法! These languages don't let themselves be easily tokenized since it depends on where these tokens occur. 这些语言不容易被标记化,因为它取决于这些令牌发生的位置。 White spaces sometimes have a special meaning (with definition lists). 空格有时具有特殊含义(带有定义列表)。 So no, it's not at all easy, IMO. 所以不,这一点都不容易,IMO。 So if this is just an exercise for you to get acquainted to ANTLR (or parser generators in general), I highly recommend choosing something else to parse. 因此,如果这只是一个让您熟悉ANTLR(或一般的解析器生成器)的练习,我强烈建议您选择其他内容进行解析。

Robin wrote: 罗宾写道:

Could someone point to my errors and maybe give me a hint on how to match regular text? 有人可以指出我的错误,也许可以给我一个如何匹配常规文本的提示?

You must first realize that ANTLR creates a lexer (tokenizer) and parser. 您必须首先意识到ANTLR创建了词法分析器(tokenizer)和解析器。 Lexer rules start with a upper case letter and parser rules start with a lower case. Lexer规则以大写字母开头,解析器规则以小写字母开头。 A parser can only operate on tokens (the objects that are made by lexer rules). 解析器只能对令牌(词法分析器规则生成的对象)进行操作。 To keep things orderly, you should not use token-literals inside parser rules (see rule q in the grammar below). 为了保持有序,你不应该在解析器规则中使用token-literals(参见下面语法中的规则q )。 Also, the ~ (negation) meta char has a different meaning depending on where it's used (in a parser- or lexer rule). 此外, ~ (否定)元字符具有不同的含义,具体取决于它的使用位置(在解析器或词法分析器规则中)。

Take the following grammar: 采用以下语法:

p : T;
q : ~'z';

T : ~'x';
U : 'y';

ANTLR will first "move" the 'z' literal to a lexer rule like this: ANTLR将首先将'z'文字“移动”到词法分析器规则,如下所示:

p : T;
q : ~RANDOM_NAME;

T : ~'x';
U : 'y';
RANDOM_NAME : 'z';

(the name RANDOM_NAME is not used, but that doesn't matter). (不使用RANDOM_NAME这个名字,但这没关系)。 Now, the parser rule q does not match any character other than 'z' ! 现在,解析器规则q'z'以外'z'任何字符都不匹配! A negation inside a parser rule negates a token (or lexer rule). 解析器规则中的否定否定了令牌(或词法分析器规则)。 So ~RANDOM_NAME will match either lexer rule T or lexer rule U . 所以~RANDOM_NAME将匹配词法分析器规则T或词法分析器规则U

Inside lexer rules, ~ negates (single!) characters. 在lexer规则中, ~否定(单个!)字符。 So the lexer rule T will match any character in the range \ .. \￿ except 'x' . 因此词法分析器规则T将匹配范围\ .. \￿'x'之外'x'任何字符。 Note that the following: ~'ab' is invalid inside a lexer rule: you can only negate single character sets. 请注意以下内容: ~'ab'在词法分析器规则中无效:您只能否定单个字符集。

So, all these ~'???' 所以,所有这些~'???' inside your parser rules are wrong (wrong as in: they don't behave as you expect them to). 你的解析器规则内部是错误的(错误的:它们的行为不像你期望的那样)。

Robin wrote: 罗宾写道:

Is there a way to set priority on the grammar rules? 有没有办法设置语法规则的优先级? Maybe this could be a lead. 也许这可能是一个领先者。

Yes, the order is top to bottom in both lexer- and parser rules (where the top has the highest priority). 是的,在词法分析器和解析器规则中,顺序是从上到下(其中顶部具有最高优先级)。 Let's say parse is the entry point of your grammar: 假设parse是语法的入口点:

parse
  :  p
  |  q
  ;

then p will first be tried, and if that fails, q is tried to match. 然后首先尝试p ,如果失败,则尝试匹配q

As for lexer rules, the rules that are keywords for example are matched before a rule that could possible match said keywords: 对于词法分析器规则,例如关键字的规则在可能匹配所述关键字的规则之前匹配:

// first keywords:
WHILE : 'while';
IF    : 'if'
ELSE  : 'else';

// and only then, the identifier rule: 
ID    : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM