简体   繁体   中英

Antlr4 Match anything (including multiple lines) between tokens

I want to parse markdown code blocks but I can't seem to get the rule right so it matches multiple lines correctly.

Here is my grammar (code.g4):

grammar code;

file: code+;
code: '```' CODE '```';

CODE: [a-z]+;
EOL: '\r'? '\n' -> skip;

And here is my input (code.txt):

```
foo
foo
```

```
bar
bar
```

```
baz
baz
```

When I run java org.antlr.v4.gui.TestRig code file -tree code.txt , I get:

line 3:0 extraneous input 'foo' expecting '```'
line 8:0 extraneous input 'bar' expecting '```'
line 13:0 extraneous input 'baz' expecting '```'
(file (code ``` foo foo ```) (code ``` bar bar ```) (code ``` baz baz ```)

I want it to match the whole code block as one token so I can parse it as one stream of bytes. What am I missing in my grammar?

(I'm using Antrl 4.10.1 and openjdk version "11.0.15" 2022-04-19 .)

You've defined just a single CODE token between the back-ticks. You need one or more CODE tokens:

code: '```' CODE+ '```';

在此处输入图像描述

That said, parsing Markdown with a tool like ANTLR (where there is a strict separation between lexer and parser rules) is going to be really hard. See: https://github.com/antlr/grammars-v4/issues/472

Your CODE Lexer rule only matches lowercase 'a'-'z' characters. It's not going to match any other characters or whitespace (including new lines and carriage returns).

That said, just correcting that Lexer rule is not going to solve your problem. You'll need to look into Lexer Modes, then, when you encounter in the outer mode, you can switch to the inner mode and matching anything (non-greedily `.*?`) followed by where you `pop your lexer mode.

(I'm pretty sure the ```'s need to be at the start of a line, so you'll also need a predicate to only match them if they're at the start of a line.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM