如何在 Antlr 中指定需要四个以上十六进制数字的 unicode 文字？

Question

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify.我想为具有需要四个以上十六进制数字来识别的代码点的 unicode 字符之间的范围定义词法分析器规则。 To be concrete, I want to declare the following rule:具体来说，我想声明以下规则：

ID_Continue : [\uE0100-\uE01EF] ;

Unfortunately, it doesn't work.不幸的是，它不起作用。 This rule will match characters that are not in this range.此规则将匹配不在此范围内的字符。 (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits): （我不确定这会导致什么确切的行为，但它不是我想要的。）我还尝试了以下操作（用前导零填充并使用 8 位数字）：

ID_Continue : [\U000E0100-\U000E01EF] ;

But it seems to result in the same unwanted behaviour.但这似乎会导致相同的不良行为。

I am using Antlr4 and the IntelliJ plugin for it for testing.我正在使用 Antlr4 和 IntelliJ 插件进行测试。

Does Antlr4 not support unicode literals above \ ? Antlr4 不支持\以上的 unicode 文字吗？

Answer 1

No, ANTLR's max is the same as Java's Character.MAX_VALUE不，ANTLR 的最大值与 Java 的Character.MAX_VALUE相同

If you look at (a part of) ANTLR4's lexer grammar you will see these rules:如果您查看（部分） ANTLR4 的词法分析器语法，您将看到以下规则：

// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
    :   Esc
        ( [btnfr"'\\]   // The standard escaped character set such as tab, newline, etc.
        | UnicodeEsc    // A Unicode escape sequence
        | .             // Invalid escape character
        | EOF           // Incomplete at EOF
        )
    ;

...

fragment UnicodeEsc
    :   'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
    ;

...

fragment Esc : '\\' ;

Answer 2

Note: the limitation to the BMP is purely a Java limitation.注意：对 BMP 的限制纯粹是 Java 限制。 Other targets might go much further.其他目标可能会走得更远。 For instance my MySQL grammar , written for ANTLR3 (C target) can easily lex eg emojis from beyond the BMP.例如，我为 ANTLR3 (C 目标) 编写的MySQL 语法可以轻松地从 BMP 之外的词法例如 emojis。 This works for quoted strings as well as IDENTIFIERs.这适用于带引号的字符串以及标识符。

What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP).然而，这里有点奇怪的是，我没有在语法中指定该范围（它仅使用 BMP）。 Still the parser can parse any utf-8 input.解析器仍然可以解析任何 utf-8 输入。 Might be a bug in the target runtime, though I'm happy it exists :-D可能是目标运行时中的错误，但我很高兴它存在:-D

如何在 Antlr 中指定需要四个以上十六进制数字的 unicode 文字？

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-03-11 11:46:33

解决方案2
0 2016-03-12 10:21:44

如何在 Antlr 中指定需要四个以上十六进制数字的 unicode 文字？

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-03-11 11:46:33

解决方案2 0 2016-03-12 10:21:44

解决方案1
2 已采纳 2016-03-11 11:46:33

解决方案2
0 2016-03-12 10:21:44