简体   繁体   English

令牌与Antlr4的匹配

[英]Matching of tokens with Antlr4

I am a an Antlr4 newbie and have problems with a relatively simple grammar. 我是Antlr4新手,语法相对较简单。 The grammar is given at the bottom at the end. 语法在末尾给出。 (This is a fragment from a grammar for parsing description of biological sequence variants). (这是语法的片段,用于解析生物学序列变体的描述)。

I am trying to parse the string "p.A3L" in the following unit test. 我正在尝试在以下单元测试中解析字符串"p.A3L"

@Test
public void testProteinSubtitutionWithoutRef() {
    ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
    HGVSLexer l = new HGVSLexer(inputStream);
    HGVSParser p = new HGVSParser(new CommonTokenStream(l));
    p.setTrace(true);
    p.addErrorListener(new BaseErrorListener() {
        @Override
        public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
            throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
        }
    });
    p.hgvs();
}

The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA" . 测试失败,并显示以下消息: “行1:2输入A3L不匹配,期望AA” I assume that this is related to lexing, ie splitting "A3L" into the three tokens A , 3 , and L , such that the parser can then generate the corresponding syntax subtree containing the three terminals from it. 我假设这与词法化有关,即将"A3L"分为三个标记A3L ,以便解析器可以从中生成包含三个终端的相应语法子树。

What is going wrong here and where can I learn how to fix this? 这里出了什么问题,我在哪里可以找到解决方法?

The grammar 语法

grammar HGVS;

hgvs: protein_var
    ;

// Basix lexemes

AA: AA1
  | AA3
  | 'X';

AA1: 'A'
   | 'R'
   | 'N'
   | 'D'
   | 'C'
   | 'Q'
   | 'E'
   | 'G'
   | 'H'
   | 'I'
   | 'L'
   | 'K'
   | 'M'
   | 'F'
   | 'P'
   | 'S'
   | 'T'
   | 'W'
   | 'Y'
   | 'V';

AA3: 'Ala'
   | 'Arg'
   | 'Asn'
   | 'Asp'
   | 'Cys'
   | 'Gln'
   | 'Glu'
   | 'Gly'
   | 'His'
   | 'Ile'
   | 'Leu'
   | 'Lys'
   | 'Met'
   | 'Phe'
   | 'Pro'
   | 'Ser'
   | 'Thr'
   | 'Trp'
   | 'Tyr'
   | 'Val';

NUMBER: [0-9]+;

NAME: [a-zA-Z0-9_]+;

// Top-level Rule

/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
           ;

There are two problems: 有两个问题:

  • Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead). 在lexer规则之前定义protein_var的规则(现在应该可以使用,但是由于另一个解析器规则在前面,因此不易阅读)。
  • Remove the rule for NAME . 删除NAME的规则。 A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule A3L不是(如您可能预期的那样) AA NUMBER AA但是NAME <= ANTLR总是喜欢最长的匹配词法分析器规则

The resulting grammar should look like: 生成的语法应如下所示:

grammar HGVS;

hgvs
    : protein_var
    ;

protein_var
    : 'p.' AA NUMBER AA
    ;

AA: ...;

AA3: ...;

AA1: ...;

NUMBER: [0-9]+;

If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAME s and AA do not have in common or by using lexer modes). 如果您需要NAME用于其他目的,则必须在词法分析器中消除它的歧义(使用NAMEAA并不通用的前缀或使用词法分析器模式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM