简体   繁体   English

wcstombs:字符编码?

[英]wcstombs: character encoding?

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". wcstombs 文档说,它“将宽字符代码序列转换为多字节字符串”。 But it never says what is a "wide-character". 但它从来没有说过什么是“广角”。

Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable? 它是隐含的,比如说它将utf-16转换为utf-8或转换是由某个环境变量定义的吗?

Also what is the typical use case of wcstombs? 另外wcstombs的典型用例是什么?

You use the setlocale() standard function with the LC_CTYPE (or LC_ALL ) category to set the mapping the library uses between wchar_t characters and multibyte characters. 您可以将setlocale()标准函数与LC_CTYPE (或LC_ALL )类别一起使用来设置库在wchar_t字符和多字节字符之间使用的映射。 The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs. 传递给setlocale()的实际语言环境名称是实现定义的,因此您需要在编译器的文档中查找它。

For example, with MSVC you might use 例如,您可以使用MSVC

setlocale( LC_ALL, ".1252" );

to set the C runtime to use codepage 1252 as the multibyte character set. 将C运行时设置为使用代码页1252作为多字节字符集。 Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets: 请注意,MSVC docs明确指出不能将多语言字符集的语言环境设置为UTF-7或UTF8:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. 可用语言,国家/地区代码和代码页的集合包括Win32 NLS API支持的所有内容,但每个字符需要两个以上字节的代码页除外,例如UTF-7和UTF-8。 If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. 如果您提供类似UTF-7或UTF-8的代码页,则setlocale将失败,返回NULL。

The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). “宽字符” wchar_t类型旨在能够支持系统支持的任何字符集 - 标准没有定义wchar_t类型的大小(它可以像char或任何较大的整数类型一样小) )。 On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). 在Windows上,它是系统的“内部”Unicode编码,即UTF-16(WinXP之前的UCS-2)。 Honestly, I can't find a direct quote on that in the MSVC docs, though. 老实说,我在MSVC文档中找不到直接引用。 Strictly speaking, the implementation should call this out, but I can't find it. 严格来说,实现应该调用它,但我找不到它。

It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. 它将您的平台用于“宽字符”(我认为确实是Windows上的UCS2,但通常是UNIX上的UCS4)转换为当前语言环境的默认多字节字符编码。 If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS. 如果您的语言环境是UTF-8,那么这将是将要使用的多字节编码 - 但请注意,还有其他可能性,例如JIS。

Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. 宽字符串由多字节字符组成,而普通C字符串是char * - 字节宽度字符序列。 Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t 虽然unicode表示通常基于wchar_t,但wchars与所有平台上的unicode不同。

I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode. 我已经看到在手机这样的嵌入式系统中使用的wchars,你希望文件名具有特殊字符,但不一定要支持unicode的所有荣耀和复杂性。

Typical usage would be converting a 2-byte based string to a regular C string, and vica versa 典型用法是将基于2字节的字符串转换为常规C字符串,反之亦然

According to the C standard, wchar_t type is "capable of representing any character in the current locale". 根据C标准, wchar_t类型“能够表示当前语言环境中的任何字符”。 The standard doesn't say what the encoding for wchar_t is. 该标准没有说明wchar_t的编码是什么。 In fact, the limits on WCHAR_MIN and WCHAR_MAX are [ 0 , 255 ] or [-127, 127], depending upon whether wchar_t is unsigned or signed. 事实上,在极限WCHAR_MINWCHAR_MAX是[ 0255 ]或[-127,127],这取决于是否wchar_t是无符号或签署。

A multibyte character can use more than one byte. 多字节字符可以使用多个字节。 A multibyte string is made of one or more multibyte characters. 多字节字符串由一个或多个多字节字符组成。 In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). 在多字节字符串中,每个字符不必具有相同的字节数(UTF-8就是一个例子)。 Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course). wchar_t类型的对象具有固定的大小(当然,在给定的实现中)。

As an aside, I can also find the following in my copy of the C99 draft: 顺便说一句,我也可以在我的C99草案副本中找到以下内容:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L ). __STDC_ISO_10646__形式的一个整数常数yyyymmL (例如, 199712L )。 If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t , has the same value as the short identifier of that character. 如果定义了此符号,则Unicode所需集中的每个字符在存储在wchar_t类型的对象中时,与该字符的短标识符具有相同的值。 The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. Unicode所需集包含ISO / IEC 10646定义的所有字符,以及指定年份和月份的所有修订和技术勘误。

So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters. 所以,如果我理解正确,如果定义了__STDC_ISO_10646__ ,那么wchar_t可以存储Unicode字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM