简体   繁体   中英

C standard : Character set and string encoding specification

I found the C standard (C99 and C11) vague with respect to character/string code positions and encoding rules:

Firstly the standard defines the source character set and the execution character set . Essentially it provides a set of glyphs, but does not associate any numerical values with them - So what is the default character set?

I'm not asking about encoding here but just the glyph/repertoire to numeric/code point mapping. It does define universal character names as ISO/IEC 10646, but does it say that this is the default charset?

As an extension to the above - I couldn't find anything which says what characters the numeric escape sequences \\0 and \\x represent.

From the C standards (C99 and C11, I didn't check ANSI C) I got the following about character and string literals:

 +---------+-----+------------+----------------------------------------------+
 | Literal | Std | Type       | Meaning                                      |
 +---------+-----+------------+----------------------------------------------+
 | '...'   | C99 | int        | An integer character constant is a  sequence |
 |         |     |            | of one or more multibyte characters          |
 | L'...'  | C99 | wchar_t    | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | u'...'  | C11 | char16_t   | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | U'...'  | C11 | char32_t   | A wide character constant is a sequence of   |
 |         |     |            | one or more multibyte characters             |
 | "..."   | C99 | char[]     | A character string literal is a sequence of  |
 |         |     |            | zero or more multibyte characters            |   
 | L"..."  | C99 | wchar_t[]  | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 | u8"..." | C11 | char[]     | A UTF-8 string literal is a sequence of zero |
 |         |     |            | or more multibyte characters                 | 
 | u"..."  | C11 | char16_t[] | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 | U"..."  | C11 | char32_t[] | A wide string literal is a sequence of zero  |
 |         |     |            | or more multibyte characters                 | 
 +---------+-----+------------+----------------------------------------------+

However I couldn't find anything about the encoding rules for these literals. UTF-8 does seem to hint UTF-8 encoding, but I don't think it's explicitly mentioned anywhere. Also, for the other types is the encoding undefined or implementation dependent?

I'm not to familiar with the UNIX specification. Does the UNIX specification specify any additional constraint(s) to these rules?

Also if anyone can tell me what charset/encoding scheme is used by GCC and MSVC that would also help.

C is not greedy about character sets. There's no such thing as "default character set", it's implementation defined - although it's mostly ASCII or UTF-8 on most modern systems.

The standard doesn't specify a default encoding because existing practice already had C implemented on machines with lots of different encodings, for example Honeywell mainframes and IBM mainframes.

I would expect gcc to take its default from the locale currently specified by LC_CHARSET, but I've never tested it.

VC++ takes its default from a Control Panel setting. That default Control Panel setting varies according to which country Windows was purchased in, and most users never change it, but they can change it while installing Windows can change it later.

Trigraphs were invented so that a source program could be copied from an environment with one locale to an environment with a slightly different locale and still be compiled. For example if a Windows user in China uses trigraphs then a Windows user in Greece would be able to compile the same source program. However, if the locales differ too much, for example one using EBCDIC and one using EUC, trigraphs won't suffice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM