简体   繁体   English

标识符中的有效字符是什么?

[英]What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits;标识符通常由下划线、数字组成; and uppercase and lowercase characters where the first character is not a digit.以及第一个字符不是数字的大写和小写字符。 When writing lexers, it is common to have helper functions such as is_digit or is_alnum .在编写词法分析器时,通常会使用诸如is_digitis_alnum类的辅助函数。 If one were to implement such a function to scan a character used in an identifier, what would it be called?如果要实现这样一个功能来扫描标识符中使用的字符,它会被称为什么? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character.显然, is_identifier是错误的,因为这将是词法分析器扫描的整个标记,而不是单个字符。 I suppose is_alnum_or_underscore would be accurate though quite verbose.我想is_alnum_or_underscore会是准确的,虽然很冗长。 For something as common as this, I feel like there should be a single word for it.对于这种常见的事情,我觉得应该有一个词来形容它。

Unicode Annex 31 ( Unicode Identifier and Pattern Syntax , UAX31 ) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. Unicode Annex 31( Unicode Identifier and Pattern SyntaxUAX31 )定义了一个定义标识符词汇语法的框架,这可能与我们将要达到的标准术语一样接近。 UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. UAX31 被 Python 和 Rust 使用(通过引用),并且已被批准用于 C++23。 So I guess it's pretty well mainstream.所以我想它是相当主流的。

UAX31 defines three sets of identifier characters, which it calls Start , Continue and Medial . UAX31 定义了三组标识符字符,称为StartContinueMedial All Start characters are also Continue characters;所有开始字符也是继续字符; no Medial character is a Continue character.没有中间字符是继续字符。

That leads to the simple regular expression ( UAX31-D1 Default Identifier Syntax ):这导致了简单的正则表达式( UAX31-D1 Default Identifier Syntax ):

<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile".声称符合 UAX31 的编程语言不需要接受每个集合的确切成员资格,但它必须明确说明所谓的“配置文件”中的偏差。 (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.) (还有其他7个要求,与本题无关,想掉下很深的兔子洞请看文档。)

That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial .这可以进一步简化,因为 UAX31 和(据我所知)任何主要语言的配置文件都没有在Medial中放置任何字符。 So you can go with the flow and just define two categories: identifier-start and identifier-continue , where the first one is a subset of the second one.因此,您可以顺其自然,只定义两个类别: identifier-startidentifier-continue ,其中第一个是第二个的子集。

You'll see that in a number of grammar documents:您会在许多语法文档中看到这一点:

Python Python
 identifier ::= xid_start xid_continue*
Rust
IDENTIFIER_OR_KEYWORD : XID_Start XID_Continue* | _ XID_Continue+
C++ C++
 identifier: identifier-start identifier identifier-continue
So that's what I'd suggest. 所以这就是我的建议。 But there are many other possibilities: 但是还有很多其他的可能性:
Swift 迅速
Calls the sets identifier-head and identifier-characters 调用集合标识符头标识符字符
Java 爪哇
Calls them JavaLetter and JavaLetterOrDigit 称它们为JavaLetterJavaLetterOrDigit
C C
Defines identifier-nondigit and identifier-digit ; 定义identifier-nondigitidentifier-digit Continue would be the union of the two sets. 继续将是两个集合的并集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM