简体   繁体   English

Antlr4解析器失败-需要回溯吗?

[英]Antlr4 parser fails - need backtracking?

I'm developing a grammar for a given language. 我正在为给定的语言开发语法。 I believe the grammar I've come up should work - but Antlr4 is of different opinion. 我相信我提出的语法应该可以工作-但是Antlr4的看法不尽相同。 Given the errors, it looks like missing backtracking. 鉴于这些错误,看起来好像缺少回溯。 But Antlr4 is supposed to parse without that... 但是Antlr4应该在没有该语法的情况下进行解析...

Each of the examples should have exactly one solution. 每个示例都应该只有一个解决方案。 There are ambiguities during parsing, however all but one option should turn out to be dead ends. 解析过程中存在歧义,但是除了一种选择之外,其他所有选择都应该是死胡同。 So I expect the parser to go back and try the next possible approach. 因此,我希望解析器返回并尝试下一种可能的方法。 But it just reports a syntax error. 但这只是报告语法错误。

Quick summary of the grammer: There are elements seperated by '#'. 语法的快速摘要:有些元素以“#”分隔。 After an element, there could be an optional jump, which is indicated by a single '=' . 在一个元素之后,可以有一个可选的跳转,由单个'='表示。 If the element itself contains a '#' or '=' , these are escaped by duplicating them. 如果元素本身包含'#'或'=',则通过复制将它们转义。 To avoid ambiguity, it is not allowed for an element to end with '#'. 为避免歧义,不允许元素以“#”结尾。 So a '###' is always first the separator, then the escaped first character of the next element. 因此,“ ###”始终始终是分隔符,然后是下一个元素的转义的第一个字符。 A '####' is no separator, just two escaped '#' inside a name. “ ####”不是分隔符,名称中只有两个转义的“#”。

The grammer: 语法:

grammar ConfigPath;
configpath: toplevelement subprojectelement* EOF;
subprojectelement:  '#' path jump?;
toplevelement:      '#' path jump?;
jump:   jumpcommand '=' jumpdestination;
jumpcommand: '#d' | '#devpath';
jumpdestination: NONHASHCHAR+;              
path: pathelement ( '/' pathelement)*;             
pathelement: escapedCharacterHash* escapedCharacter ;
escapedCharacterHash: escapedCharacter | '##';
escapedCharacter: NONHASHCHAR | '==';
NONHASHCHAR: ~('#' | '/' | '=' );
HASH: '#';
EQ: '=';

The tests, with parser errors as comments 测试,以解析器错误作为注释

@Test
public void testTripleHash() throws Exception {
    ConfigpathContext c = parse("#BU/ConfigPath###sub"); 
    // line 1:16 extraneous input '#' expecting {'##', '==', NONHASHCHAR}

    Assert.assertEquals( "#BU/ConfigPath", c.toplevelement().getText() );
    Assert.assertEquals( "###sub", c.subprojectelement().get(0).path().getText() );
}

Since the pathelement cannot end with a hash, the first of the triple hash should close the toplevelelement and start the subprojectelement, which begins with a ## 由于pathelement不能以哈希结尾,因此三元哈希中的第一个应该关闭toplevelelement并开始子项目,该子项目以##开头

@Test
public void testDoubleHash() throws Exception {
    ConfigpathContext c = parse("#BU/proj##bla#d==u##bla");
    // line 1:15 mismatched input '==' expecting '='

    Assert.assertEquals( "#BU/proj##bla", c.toplevelement().getText() );
    Assert.assertEquals( "#d==u##bla", c.subprojectelement().get(0).getText() );
}

@Test
public void testJumps() throws Exception {
    ConfigpathContext c = parse("#BU/pro##dla#du##d==la#d=dest");
    // line 1:14 missing '=' at 'u'

    Assert.assertEquals( "#BU/pro##dla", c.toplevelement().getText() );
    Assert.assertEquals( 1, c.subprojectelement().size());
    Assert.assertEquals( "#du##d==la", c.subprojectelement().get(0).path().getText() );
    Assert.assertEquals( "dest", c.subprojectelement().get(0).jump().jumpdestination().getText() );
}


private ConfigpathContext parse(String src) {
    ConfigPathParser parser = new ConfigPathParser(new CommonTokenStream(new ConfigPathLexer(new ANTLRInputStream(src))));
    parser.addErrorListener(new BaseErrorListener() {
        @Override
        public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
            throw new RuntimeException("line " + line + ":" + charPositionInLine + " " + msg );
        }
    });
    return parser.configpath();
}

Is there any way to change the grammar to accept the tests? 有什么办法可以改变语法以接受测试? Or is Antlr4 just not able to parse such a grammar? 还是Antlr4不能解析这样的语法? Would Antlr3 with backtracking find the solutions? 具有回溯功能的Antlr3会找到解决方案吗?

The grammer was wrong - thanks to cantSleepNow for stating that. 语法错误-感谢cantSleepNow指出了这一点。

While I haven't understood every detail of the problem, it seems to be related to ambiguities in the Lexer. 尽管我还不了解问题的每个细节,但它似乎与Lexer中的歧义有关。 The parser is able to resolve ambiguities through its alternative to backtracking, but the Lexer can't. 解析器能够通过其替代回溯的方式解决歧义,但是Lexer不能。

So here is the working grammer: 所以这是工作语法:

grammar ConfigPath;

configpath: toplevelement subprojectelement* EOF;

subprojectelement:  '#' path jump?;

toplevelement:      '#' path jump?;

jump:   jumpcommand '=' jumpdestination;

jumpdestination : string;

jumpcommand: HASH D 'devpath'?;

path: pathelement ( '/' pathelement)*;             
pathelement: escapedCharacterHash* escapedCharacter ;

escapedCharacterHash: escapedCharacter | HASH HASH;
escapedCharacter: string | EQ EQ;
string  : (NONHASHCHAR | D)+;
NONHASHCHAR: ~('#' | '/' | '=' | 'd' );
D: 'd';
HASH: '#';
EQ: '=';

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM