简体   繁体   English

C标准:字符集和字符串编码规范

[英]C standard : Character set and string encoding specification

I found the C standard (C99 and C11) vague with respect to character/string code positions and encoding rules: 我发现关于字符/字符串代码位置和编码规则的C标准(C99和C11)含糊不清:

Firstly the standard defines the source character set and the execution character set . 首先,该标准定义the source character setthe execution character set Essentially it provides a set of glyphs, but does not associate any numerical values with them - So what is the default character set? 本质上,它提供了一组字形,但是不将任何数字值与它们相关联- 那么默认字符集是什么?

I'm not asking about encoding here but just the glyph/repertoire to numeric/code point mapping. 我不是在这里编码,而只是字形/库到数字/代码点的映射。 It does define universal character names as ISO/IEC 10646, but does it say that this is the default charset? 它确实将universal character names定义为ISO / IEC 10646,但这是否表示这是默认字符集?

As an extension to the above - I couldn't find anything which says what characters the numeric escape sequences \\0 and \\x represent. 作为上述内容的扩展-我什么都找不到,数字转义序列\\ 0和\\ x表示什么字符。

From the C standards (C99 and C11, I didn't check ANSI C) I got the following about character and string literals: 从C标准(C99和C11,我没有检查ANSI C),我得到了有关字符和字符串文字的以下信息:

 +---------+-----+------------+----------------------------------------------+
 | Literal | Std | Type       | Meaning                                      |
 +---------+-----+------------+----------------------------------------------+
 | '...'   | C99 | int        | An integer character constant is a  sequence |
 |         |     |            | of one or more multibyte characters          |
 | L'...'  | C99 | wchar_t    | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | u'...'  | C11 | char16_t   | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | U'...'  | C11 | char32_t   | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | "..."   | C99 | char[]     | A character string literal is a sequence of  |
 |         |     |            | zero or more multibyte characters            |   
 | L"..."  | C99 | wchar_t[]  | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 | u8"..." | C11 | char[]     | A UTF-8 string literal is a sequence of zero |
 |         |     |            | or more multibyte characters                 | 
 | u"..."  | C11 | char16_t[] | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 | U"..."  | C11 | char32_t[] | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 +---------+-----+------------+----------------------------------------------+

However I couldn't find anything about the encoding rules for these literals. 但是我找不到关于这些文字的编码规则的任何信息。 UTF-8 does seem to hint UTF-8 encoding, but I don't think it's explicitly mentioned anywhere. UTF-8似乎暗示了UTF-8编码,但是我认为没有在任何地方明确提及它。 Also, for the other types is the encoding undefined or implementation dependent? 另外,对于其他类型,编码是未定义的还是实现相关的?

I'm not to familiar with the UNIX specification. 我不熟悉UNIX规范。 Does the UNIX specification specify any additional constraint(s) to these rules? UNIX规范是否为这些规则指定了任何其他约束?

Also if anyone can tell me what charset/encoding scheme is used by GCC and MSVC that would also help. 另外,如果有人可以告诉我,GCC和MSVC使用哪种字符集/编码方案也有帮助。

C is not greedy about character sets. C对字符集并不贪婪。 There's no such thing as "default character set", it's implementation defined - although it's mostly ASCII or UTF-8 on most modern systems. 没有定义为“默认字符集”的东西,它是由实现定义的-尽管在大多数现代系统中,它大多是ASCII或UTF-8。

The standard doesn't specify a default encoding because existing practice already had C implemented on machines with lots of different encodings, for example Honeywell mainframes and IBM mainframes. 该标准未指定默认编码,因为现有实践已经在具有许多不同编码的机器(例如Honeywell大型机和IBM大型机)上实现了C语言。

I would expect gcc to take its default from the locale currently specified by LC_CHARSET, but I've never tested it. 我希望gcc从LC_CHARSET当前指定的语言环境中获取其默认值,但我从未测试过。

VC++ takes its default from a Control Panel setting. VC ++从“控制面板”设置中获取其默认设置。 That default Control Panel setting varies according to which country Windows was purchased in, and most users never change it, but they can change it while installing Windows can change it later. 默认的“控制面板”设置根据购买Windows的国家/地区而异,大多数用户从不更改它,但是他们可以在安装Windows时进行更改,以后再进行更改。

Trigraphs were invented so that a source program could be copied from an environment with one locale to an environment with a slightly different locale and still be compiled. 发明了Trigraph,以便可以将源程序从具有一种语言环境的环境复制到具有稍微不同的语言环境的环境中,并且仍然可以对其进行编译。 For example if a Windows user in China uses trigraphs then a Windows user in Greece would be able to compile the same source program. 例如,如果中国的Windows用户使用三维字母,那么希腊的Windows用户将能够编译相同的源程序。 However, if the locales differ too much, for example one using EBCDIC and one using EUC, trigraphs won't suffice. 但是,如果语言环境差异太大,例如使用EBCDIC的语言环境和使用EUC的语言环境,三字母组合就不能满足要求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM