Antlr4 匹配标记之间的任何内容（包括多行）

Question

I want to parse markdown code blocks but I can't seem to get the rule right so it matches multiple lines correctly.我想解析 markdown 代码块，但我似乎无法正确理解规则，因此它正确匹配多行。

Here is my grammar (code.g4):这是我的语法（code.g4）：

grammar code;

file: code+;
code: '```' CODE '```';

CODE: [a-z]+;
EOL: '\r'? '\n' -> skip;

And here is my input (code.txt):这是我的输入（code.txt）：

```
foo
foo
```

```
bar
bar
```

```
baz
baz
```

When I run java org.antlr.v4.gui.TestRig code file -tree code.txt , I get:当我运行java org.antlr.v4.gui.TestRig code file -tree code.txt时，我得到：

line 3:0 extraneous input 'foo' expecting '```'
line 8:0 extraneous input 'bar' expecting '```'
line 13:0 extraneous input 'baz' expecting '```'
(file (code ``` foo foo ```) (code ``` bar bar ```) (code ``` baz baz ```)

I want it to match the whole code block as one token so I can parse it as one stream of bytes.我希望它将整个代码块匹配为一个标记，以便我可以将其解析为一个 stream 字节。 What am I missing in my grammar?我的语法中缺少什么？

(I'm using Antrl 4.10.1 and openjdk version "11.0.15" 2022-04-19 .) （我使用的是Antrl 4.10.1和openjdk 版本“11.0.15”2022-04-19 。）

Answer 1

You've defined just a single CODE token between the back-ticks.您只在反引号之间定义了一个CODE标记。 You need one or more CODE tokens:您需要一个或多个CODE代币：

code: '```' CODE+ '```';

That said, parsing Markdown with a tool like ANTLR (where there is a strict separation between lexer and parser rules) is going to be really hard.也就是说，使用像 ANTLR 这样的工具解析 Markdown（词法分析器和解析器规则之间存在严格的分离）将非常困难。 See: https://github.com/antlr/grammars-v4/issues/472参见： https://github.com/antlr/grammars-v4/issues/472

Answer 2

Your CODE Lexer rule only matches lowercase 'a'-'z' characters.您的CODE Lexer 规则仅匹配小写的“a”-“z”字符。 It's not going to match any other characters or whitespace (including new lines and carriage returns).它不会匹配任何其他字符或空格（包括换行符和回车符）。

That said, just correcting that Lexer rule is not going to solve your problem.也就是说，仅仅纠正 Lexer 规则并不能解决您的问题。 You'll need to look into Lexer Modes, then, when you encounter in the outer mode, you can switch to the inner mode and matching anything (non-greedily `.*?`) followed by where you `pop your lexer mode.您需要查看词法分析器模式，然后，当您遇到in the outer mode, you can switch to the inner mode and matching anything (non-greedily `.*?`) followed by您“弹出词法分析器模式”的位置。

(I'm pretty sure the ```'s need to be at the start of a line, so you'll also need a predicate to only match them if they're at the start of a line.) （我很确定 ``` 需要位于一行的开头，因此您还需要一个谓词来仅在它们位于一行的开头时匹配它们。）

Antlr4 匹配标记之间的任何内容（包括多行）

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-05-04 16:32:11

解决方案2
0 2022-05-04 15:37:27

Antlr4 匹配标记之间的任何内容（包括多行）

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-05-04 16:32:11

解决方案2 0 2022-05-04 15:37:27

解决方案1
2 已采纳 2022-05-04 16:32:11

解决方案2
0 2022-05-04 15:37:27