简体繁体 English

为什么 C 标准规定字符串文字应以初始移位状态开始和结束？

[英]Why does the C standard state that string literals shall begin and end in the initial shift state?

原文 2022-06-01 17:23:23 9 1 c/ string-literals/ c89

The ANSI X3.159-1989 "Programming Language C" standard states in the chapter "5.2.1.2 - Multibyte characters" that: ANSI X3.159-1989“编程语言 C”标准在“5.2.1.2 - 多字节字符”一章中指出：

For the source character set, the following shall hold:对于源字符集，应满足以下条件：

A comment, string literal, character constant, or header name shall begin and end in the initial shift state.注释、字符串文字、字符常量或标题名称应以初始移位状态开始和结束。

Does it mean that a string literal or etc. shall begin and end with a character, represented by a value of the initial shift state, ie a single-byte value?这是否意味着字符串文字等应以字符开头和结尾，由初始移位状态的值表示，即单字节值？ Or does it mean that the environment shall reset it's current shift state to the initial shift state before and after processing a certain string literal or etc?或者这是否意味着环境应在处理某个字符串文字等之前和之后将其当前移位状态重置为初始移位状态？
Why so?为什么这样？ - Ie what is the purpose to set the initial shift state, especially at the end of a string literal or etc? - 即设置初始移位状态的目的是什么，特别是在字符串文字等的末尾？

1 个解决方案

Why does the C standard state that string literals shall begin and end in the initial shift state?为什么 C 标准规定字符串文字应以初始移位状态开始和结束？

Let's first see what exactly (or as exactly as the specification gets) is meant by "shift state":让我们首先看看“转变状态”到底是什么意思（或与规范一样）：

A multibyte character may have a state-dependent encoding , wherein each sequence of multibyte characters begins in an initial shift state and enters other implementation-defined shift states when specific multibyte characters are encountered in the sequence.多字节字符可以具有与状态相关的编码，其中每个多字节字符序列以初始移位状态开始，并在序列中遇到特定多字节字符时进入其他实现定义的移位状态。 While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state.在初始移位状态下，所有单字节字符都保留其通常的解释并且不会改变移位状态。 The interpretation for subsequent bytes in the sequence is a function of the current shift state.序列中后续字节的解释是当前移位状态的函数。

Requiring string literals to begin and end in the initial shift state makes string semantics a lot simpler and more predictable.要求字符串文字以初始移位状态开始和结束使字符串语义更简单且更可预测。 If you concatenate two strings, or output them one after the other, you can be confident that their juxtaposition does not change the meaning of the latter.如果你连接两个字符串，或者一个接一个地输出它们，你可以确信它们的并置不会改变后者的含义。 If the first could terminate in a shift state different from the initial one then that would not be guaranteed.如果第一个可以以不同于初始状态的换档状态终止，那么这将无法保证。

The inherent assumption underlying all this is that language-level semantics are ignorant of the details of any particular character encoding.所有这一切的内在假设是语言级语义不知道任何特定字符编码的细节。 They treat all strings as black boxes of bytes, terminated by a null character.它们将所有字符串视为字节的黑盒，以空字符终止。

Does it mean that a string literal or etc. shall begin and end with a character, represented by a value of the initial shift state, ie a single-byte value?这是否意味着字符串文字等应以字符开头和结尾，由初始移位状态的值表示，即单字节值？ Or does it mean that the environment shall reset it's current shift state to the initial shift state before and after processing a certain string literal or etc?或者这是否意味着环境应在处理某个字符串文字等之前和之后将其当前移位状态重置为初始移位状态？

Neither.两者都不。 With a state-dependent encoding, the current shift state is a running property of an interpretation of an encoded character sequence.对于依赖于状态的编码，当前移位状态是对编码字符序列的解释的运行属性。 Characters do not necessarily encode shift states directly, but the encoding scheme provides a way to specify shift state changes.字符不一定直接编码移位状态，但编码方案提供了一种指定移位状态变化的方法。

Details can vary with the particular encoding scheme, but encoded characters, whether single- or multibyte, are not generally in a particular shift state inherently.细节可能随特定的编码方案而变化，但编码字符，无论是单字节还是多字节，通常并不固有地处于特定的移位状态。 The whole point of such an encoding is that the same subsequence may be interpreted differently depending on the shift state.这种编码的重点是相同的子序列可以根据移位状态进行不同的解释。 Thus, starting in the initial shift state is an assertion about how multibyte character sequences will be interpreted, and only by implication a statement about what a string literal must contain.因此，从初始移位状态开始是关于如何解释多字节字符序列的断言，并且仅暗示关于字符串文字必须包含什么的陈述。

Ending in the initial shift state, on the other hand, is a constraint on the contents of the string, etc .另一方面，以初始移位状态结束是对字符串等内容的约束。 AC source file is malformed if the bytes of a string literal etc .如果字符串文字等的字节，则 AC 源文件格式不正确。 within, interpreted as starting in the initial shift state, encode one or more state shifts such that the shift state at the end of the byte sequence is different from the initial shift state.内部，解释为从初始移位状态开始，对一个或多个状态移位进行编码，使得字节序列末尾的移位状态与初始移位状态不同。 This is exactly to relieve the implementation from having to be concerned with encoding issues, and absolutely not to require it to perform any kind of shift-state cleanup.这正是为了让实现不必担心编码问题，并且绝对不需要它执行任何类型的移位状态清理。

Why so?为什么这样？ - Ie what is the purpose to set the initial shift state, especially at the end of a string literal or etc? - 即设置初始移位状态的目的是什么，特别是在字符串文字等的末尾？

It simplifies the language and improves the maintainability of C source files written in state-dependent source encodings.它简化了语言并提高了以状态相关源编码编写的 C 源文件的可维护性。 Each unit that accepts user-defined free(ish) text is modular -- it has the same, well-defined meaning regardless of the surrounding context, and moving, copying, or deleting such units cannot change the lexical interpretation of the surrounding tokens.接受用户定义的自由（ish）文本的每个单元都是模块化的——无论周围的上下文如何，它都具有相同的、明确定义的含义，并且移动、复制或删除这些单元不能改变周围标记的词汇解释。