[英]Matching of tokens with Antlr4
I am a an Antlr4 newbie and have problems with a relatively simple grammar. 我是Antlr4新手,语法相对较简单。 The grammar is given at the bottom at the end. 语法在末尾给出。 (This is a fragment from a grammar for parsing description of biological sequence variants). (这是语法的片段,用于解析生物学序列变体的描述)。
I am trying to parse the string "p.A3L"
in the following unit test. 我正在尝试在以下单元测试中解析字符串"p.A3L"
。
@Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
@Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA" . 测试失败,并显示以下消息: “行1:2输入A3L不匹配,期望AA” 。 I assume that this is related to lexing, ie splitting "A3L"
into the three tokens A
, 3
, and L
, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it. 我假设这与词法化有关,即将"A3L"
分为三个标记A
, 3
和L
,以便解析器可以从中生成包含三个终端的相应语法子树。
What is going wrong here and where can I learn how to fix this? 这里出了什么问题,我在哪里可以找到解决方法?
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;
There are two problems: 有两个问题:
protein_var
ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead). 在lexer规则之前定义protein_var
的规则(现在应该可以使用,但是由于另一个解析器规则在前面,因此不易阅读)。 NAME
. 删除NAME
的规则。 A3L
is not (as you probably expected) AA NUMBER AA
but NAME
<= ANTLR always prefers the longest matching lexer rule A3L
不是(如您可能预期的那样) AA NUMBER AA
但是NAME
<= ANTLR总是喜欢最长的匹配词法分析器规则 The resulting grammar should look like: 生成的语法应如下所示:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME
for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAME
s and AA
do not have in common or by using lexer modes). 如果您需要NAME
用于其他目的,则必须在词法分析器中消除它的歧义(使用NAME
和AA
并不通用的前缀或使用词法分析器模式)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.