简体   繁体   English

如何在 Antlr 中指定需要四个以上十六进制数字的 unicode 文字?

[英]How do I specify a unicode literal that requires more than four hex digits in Antlr?

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify.我想为具有需要四个以上十六进制数字来识别的代码点的 unicode 字符之间的范围定义词法分析器规则。 To be concrete, I want to declare the following rule:具体来说,我想声明以下规则:

ID_Continue : [\uE0100-\uE01EF] ;

Unfortunately, it doesn't work.不幸的是,它不起作用。 This rule will match characters that are not in this range.此规则将匹配不在此范围内的字符。 (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits): (我不确定这会导致什么确切的行为,但它不是我想要的。)我还尝试了以下操作(用前导零填充并使用 8 位数字):

ID_Continue : [\U000E0100-\U000E01EF] ;

But it seems to result in the same unwanted behaviour.但这似乎会导致相同的不良行为。

I am using Antlr4 and the IntelliJ plugin for it for testing.我正在使用 Antlr4 和 IntelliJ 插件进行测试。

Does Antlr4 not support unicode literals above \￿ ? Antlr4 不支持\￿以上的 unicode 文字吗?

No, ANTLR's max is the same as Java's Character.MAX_VALUE不,ANTLR 的最大值与 Java 的Character.MAX_VALUE相同

If you look at (a part of) ANTLR4's lexer grammar you will see these rules:如果您查看(部分) ANTLR4 的词法分析器语法,您将看到以下规则:

// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
    :   Esc
        ( [btnfr"'\\]   // The standard escaped character set such as tab, newline, etc.
        | UnicodeEsc    // A Unicode escape sequence
        | .             // Invalid escape character
        | EOF           // Incomplete at EOF
        )
    ;

...

fragment UnicodeEsc
    :   'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
    ;

...

fragment Esc : '\\' ;

Note: the limitation to the BMP is purely a Java limitation.注意:对 BMP 的限制纯粹是 Java 限制。 Other targets might go much further.其他目标可能会走得更远。 For instance my MySQL grammar , written for ANTLR3 (C target) can easily lex eg emojis from beyond the BMP.例如, 为 ANTLR3 (C 目标) 编写的MySQL 语法可以轻松地从 BMP 之外的词法例如 emojis。 This works for quoted strings as well as IDENTIFIERs.这适用于带引号的字符串以及标识符。

在此处输入图片说明

What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP).然而,这里有点奇怪的是,我没有在语法中指定该范围(它仅使用 BMP)。 Still the parser can parse any utf-8 input.解析器仍然可以解析任何 utf-8 输入。 Might be a bug in the target runtime, though I'm happy it exists :-D可能是目标运行时中的错误,但我很高兴它存在:-D

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用大于四位十六进制数字的代码打印unicode字符 - Print unicode character with code bigger than four hex digits 如何匹配antlr中的unicode字符 - How do I match unicode characters in antlr 如何指定一系列unicode字符 - How do I specify a range of unicode characters 如何在我的Antlr语法中加入unicode字符? - How do I put unicode characters in my Antlr grammar? 如何在 PowerShell 字符串文字中编码 Unicode 字符代码? - How do I encode Unicode character codes in a PowerShell string literal? Unicode 标准第四章的字符和数字 - Characters and digits of Chapter four of the Unicode Standard 如何在Twilio SMS请求中指定unicode消息正文? - How do I specify a unicode message body in a Twilio SMS request? 如何在char16_t字符串文字中编写unicode点U + 10000? - How do I codify the unicode point U+10000 in a char16_t string literal? 当字符集为 ASCII 时,如何在文字字符串 ISO/ANSI C 中表示 Unicode 字符? - How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII? 如何使用JavaScript / jQuery从HTML中获取符号的unicode / hex表示? - How do I get the unicode/hex representation of a symbol out of the HTML using JavaScript/jQuery?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM