简体   繁体   English

std::wstring VS std::string

[英]std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring .我无法理解std::stringstd::wstring之间的区别。 I know wstring supports wide characters such as Unicode characters.我知道wstring支持宽字符,例如 Unicode 字符。 I have got the following questions:我有以下问题:

  1. When should I use std::wstring over std::string ?我什么时候应该在std::string使用std::wstring
  2. Can std::string hold the entire ASCII character set, including the special characters? std::string保存整个 ASCII 字符集,包括特殊字符吗?
  3. Is std::wstring supported by all popular C++ compilers?所有流行的 C++ 编译器都支持std::wstring吗?
  4. What is exactly a " wide character "?什么是“宽字符”?

string ? string wstring ? wstring

std::string is a basic_string templated on a char , and std::wstring on a wchar_t . std::string是在char上模板化的basic_string ,在wchar_t上是std::wstring

char vs. wchar_t charwchar_t

char is supposed to hold a character, usually an 8-bit character. char应该保存一个字符,通常是一个 8 位字符。
wchar_t is supposed to hold a wide character, and then, things get tricky: wchar_t应该包含一个宽字符,然后,事情变得棘手:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.在 Linux 上,一个wchar_t是 4 个字节,而在 Windows 上,它是 2 个字节。

What about Unicode , then?那么Unicode呢?

The problem is that neither char nor wchar_t is directly tied to unicode.问题是charwchar_t都没有直接绑定到 unicode。

On Linux?在 Linux 上?

Let's take a Linux OS: My Ubuntu system is already unicode aware.让我们以 Linux 操作系统为例:我的 Ubuntu 系统已经支持 unicode。 When I work with a char string, it is natively encoded in UTF-8 (ie Unicode string of chars).当我使用字符字符串时,它以UTF-8 (即字符的 Unicode 字符串)本机编码。 The following code:以下代码:

#include <cstring>
#include <iostream>

int main(int argc, char* argv[])
{
   const char text[] = "olé" ;


   std::cout << "sizeof(char)    : " << sizeof(char) << std::endl ;
   std::cout << "text            : " << text << std::endl ;
   std::cout << "sizeof(text)    : " << sizeof(text) << std::endl ;
   std::cout << "strlen(text)    : " << strlen(text) << std::endl ;

   std::cout << "text(ordinals)  :" ;

   for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned char>(text[i])
                          );
   }

   std::cout << std::endl << std::endl ;

   // - - - 

   const wchar_t wtext[] = L"olé" ;

   std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
   //std::cout << "wtext           : " << wtext << std::endl ; <- error
   std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << std::endl ;
   std::wcout << L"wtext           : " << wtext << std::endl;

   std::cout << "sizeof(wtext)   : " << sizeof(wtext) << std::endl ;
   std::cout << "wcslen(wtext)   : " << wcslen(wtext) << std::endl ;

   std::cout << "wtext(ordinals) :" ;

   for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned short>(wtext[i])
                              );
   }

   std::cout << std::endl << std::endl ;

   return 0;
}

outputs the following text:输出以下文本:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol�
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero).您会看到char的“olé”文本实际上由四个字符构成:110、108、195 和 169(不包括尾随零)。 (I'll let you study the wchar_t code as an exercise) (我会让你学习wchar_t代码作为练习)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it.因此,有工作时char在Linux上,你应该通常最终会使用Unicode甚至不知道它。 And as std::string works with char , so std::string is already unicode-ready.因为std::stringchar ,所以std::string已经准备好 unicode 了。

Note that std::string , like the C string API, will consider the "olé" string to have 4 characters, not three.请注意, std::string与 C 字符串 API 一样,会将“olé”字符串视为 4 个字符,而不是 3 个。 So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.因此,在截断/使用 unicode 字符时应谨慎,因为 UTF-8 中禁止某些字符组合。

On Windows?在 Windows 上?

On Windows, this is a bit different.在 Windows 上,这有点不同。 Win32 had to support a lot of application working with char and on different charsets / codepages produced in all the world, before the advent of Unicode.在 Unicode 出现之前,Win32 必须支持大量使用char的应用程序以及世界各地产生的不同字符集/代码页

So their solution was an interesting one: If an application works with char , then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine.所以他们的解决方案很有趣:如果应用程序使用char ,那么使用机器上的本地字符集/代码页在 GUI 标签上编码/打印/显示字符字符串。 For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251 ).例如,“olé”在法语本地化的 Windows 中将是“olé”,但在西里尔文本地化的 Windows 中会有所不同(如果使用Windows-1251,则为“olй”)。 Thus, "historical apps" will usually still work the same old way.因此,“历史应用程序”通常仍会以同样的旧方式工作。

For Unicode based applications, Windows uses wchar_t , which is 2-bytes wide, and is encoded in UTF-16 , which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).对于基于 Unicode 的应用程序,Windows 使用wchar_t ,它是 2 字节宽,并以UTF-16编码,这是在 2 字节字符上编码的 Unicode(或者至少,最兼容的 UCS-2,几乎是同样的事情IIRC)。

Applications using char are said "multibyte" (because each glyph is composed of one or more char s), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t . See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.使用char应用程序称为“多字节”(因为每个字形由一个或多个char组成),而使用wchar_t应用程序称为“widechar”(因为每个字形由一个或两个wchar_t 。参见MultiByteToWideCharWideCharToMultiByte Win32 转换 API了解更多信息。

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT ...).因此,如果您在 Windows 上工作,您非常想使用wchar_t (除非您使用隐藏它的框架,例如GTK+QT ...)。 The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).事实是,在幕后,Windows 使用wchar_t字符串,因此,即使是历史应用程序,在使用像SetWindowText() (在 Win32 GUI 上设置标签的低级 API 函数SetWindowText()等 API 时,也会将其char字符串转换为wchar_t

Memory issues?内存问题?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less). UTF-32 是每个字符 4 个字节,所以没有什么可添加的,只要 UTF-8 文本和 UTF-16 文本总是比 UTF-32 文本使用更少或相同的内存量(通常更少)。

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.如果存在内存问题,那么您应该知道,与大多数西方语言相比,UTF-8 文本将比相同的 UTF-16 文本使用更少的内存。

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.尽管如此,对于其他语言(中文、日语等),UTF-8 使用的内存将与 UTF-16 相同或略大。

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.总而言之,UTF-16 每个字符将主要使用 2 个字节,偶尔使用 4 个字节(除非您正在处理某种深奥的语言字形(克林贡语?精灵语?),而 UTF-8 将花费 1 到 4 个字节。

See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.有关更多信息,请参阅http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16

Conclusion结论

  1. When I should use std::wstring over std::string?什么时候我应该使用 std::wstring 而不是 std::string?

    On Linux?在 Linux 上? Almost never (§).几乎从不(§)。
    On Windows?在 Windows 上? Almost always (§).几乎总是(§)。
    On cross-platform code?关于跨平台代码? Depends on your toolkit...取决于你的工具包...

    (§) : unless you use a toolkit/framework saying otherwise (§) :除非您使用工具包/框架另有说明

  2. Can std::string hold all the ASCII character set including special characters? std::string保存所有 ASCII 字符集,包括特殊字符吗?

    Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!注意: std::string适用于保存“二进制”缓冲区,而std::wstring则不是!

    On Linux?在 Linux 上? Yes.是的。
    On Windows?在 Windows 上? Only special characters available for the current locale of the Windows user.只有特殊字符可用于 Windows 用户的当前区域设置。

    Edit (After a comment from Johann Gerell ):编辑(在Johann Gerell发表评论后):
    a std::string will be enough to handle all char -based strings (each char being a number from 0 to 255). std::string足以处理所有基于char的字符串(每个char是一个从 0 到 255 的数字)。 But:但是:

    1. ASCII is supposed to go from 0 to 127. Higher char s are NOT ASCII. ASCII 应该从 0 到 127。更高的char不是 ASCII。
    2. a char from 0 to 127 will be held correctly从 0 到 127 的char将被正确保存
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.从 128 到 255 的char将根据您的编码(unicode、非 unicode 等)具有含义,但只要它们以 UTF-8 编码,它就能够保存所有 Unicode 字形。
  3. Is std::wstring supported by almost all popular C++ compilers?几乎所有流行的 C++ 编译器都支持std::wstring吗?

    Mostly, with the exception of GCC based compilers that are ported to Windows.大多数情况下,移植到 Windows 的基于 GCC 的编译器除外。
    It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.它适用于我的 g++ 4.3.2(在 Linux 下),并且我从 Visual C++ 6 开始在 Win32 上使用 Unicode API。

  4. What is exactly a wide character?什么是宽字符?

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type.在 C/C++ 上,它是一种写成wchar_t的字符类型,它比简单的char字符类型大。 It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).它应该用于放置索引(如 Unicode 字形)大于 255(或 127,取决于...)的字符。

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.我建议避免在 Windows 或其他地方使用std::wstring ,除非接口要求,或者在 Windows API 调用附近的任何地方以及作为语法糖的相应编码转换。

My view is summarized in http://utf8everywhere.org of which I am a co-author.我的观点总结在http://utf8everywhere.org中,我是其中的合著者。

Unless your application is API-call-centric, eg mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls.除非您的应用程序是以 API 调用为中心的,例如主要是 UI 应用程序,否则建议将 Unicode 字符串存储在 std::string 中并以 UTF-8 编码,在 API 调用附近执行转换。 The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications.文章中概述的好处超过了转换的明显烦恼,尤其是在复杂的应用程序中。 This is doubly so for multi-platform and library development.对于多平台和库开发来说更是如此。

And now, answering your questions:现在,回答您的问题:

  1. A few weak reasons.一些薄弱的原因。 It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode.它的存在是出于历史原因,其中宽字符被认为是支持 Unicode 的正确方式。 It is now used to interface APIs that prefer UTF-16 strings.它现在用于连接更喜欢 UTF-16 字符串的 API。 I use them only in the direct vicinity of such API calls.我只在此类 API 调用的附近使用它们。
  2. This has nothing to do with std::string.这与 std::string 无关。 It can hold whatever encoding you put in it.它可以保存您放入的任何编码。 The only question is how You treat its content.唯一的问题是如何对待其内容。 My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly.我的建议是 UTF-8,因此它能够正确保存所有 Unicode 字符。 It's a common practice on Linux, but I think Windows programs should do it also.这是 Linux 上的常见做法,但我认为 Windows 程序也应该这样做。
  3. No.没有。
  4. Wide character is a confusing name.宽字符是一个令人困惑的名称。 In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name.在 Unicode 的早期,人们相信一个字符可以用两个字节编码,因此得名。 Today, it stands for "any part of the character that is two bytes long".今天,它代表“两个字节长的字符的任何部分”。 UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). UTF-16 被视为此类字节对(又名宽字符)的序列。 A character in UTF-16 takes either one or two pairs. UTF-16 中的字符需要一对或两对。

So, every reader here now should have a clear understanding about the facts, the situation.所以,现在在座的每一位读者都应该对事实、情况有一个清醒的认识。 If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].如果没有,那么您必须阅读 Paercebal 非常全面的答案[顺便说一句:谢谢!]。

My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless.我的实用结论非常简单:所有 C++(和 STL)“字符编码”的东西基本上都已损坏且毫无用处。 Blame it on Microsoft or not, that will not help anyway.不管是否归咎于微软,这无论如何都无济于事。

My solution, after in-depth investigation, much frustration and the consequential experiences is the following:我的解决方案,经过深入调查,非常沮丧和随之而来的经历如下:

  1. accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)接受,您必须自己负责编码和转换的内容(并且您会发现其中大部分内容相当琐碎)

  2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String )将 std::string 用于任何 UTF-8 编码的字符串(只是一个typedef std::string UTF8String

  3. accept that such an UTF8String object is just a dumb, but cheap container.接受这样的 UTF8String 对象只是一个愚蠢但便宜的容器。 Do never ever access and/or manipulate characters in it directly (no search, replace, and so on).永远不要直接访问和/或操作其中的字符(没有搜索、替换等)。 You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings!你可以,但你真的真的,真的不想浪费时间为多字节字符串编写文本操作算法! Even if other people already did such stupid things, don't do that!就算别人已经干过这种蠢事,也不要那样做! Let it be!让它吧! (Well, there are scenarios where it makes sense... just use the ICU library for those). (好吧,有些场景是有意义的……只需使用 ICU 库即可)。

  4. use std::wstring for UCS-2 encoded strings ( typedef std::wstring UCS2String ) - this is a compromise, and a concession to the mess that the WIN32 API introduced).将 std::wstring 用于 UCS-2 编码字符串( typedef std::wstring UCS2String )——这是一种妥协,也是对 WIN32 API 引入的混乱的让步)。 UCS-2 is sufficient for most of us (more on that later...). UCS-2 对我们大多数人来说已经足够了(稍后会详细介绍......)。

  5. use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on).只要需要逐个字符的访问(读取、操作等),就使用 UCS2String 实例。 Any character-based processing should be done in a NON-multibyte-representation.任何基于字符的处理都应该在非多字节表示中完成。 It is simple, fast, easy.它简单、快速、容易。

  6. add two utility functions to convert back & forth between UTF-8 and UCS-2:添加两个实用函数以在 UTF-8 和 UCS-2 之间来回转换:

     UCS2String ConvertToUCS2( const UTF8String &str ); UTF8String ConvertToUTF8( const UCS2String &str );

The conversions are straightforward, google should help here ...转换很简单,谷歌应该在这里提供帮助......

That's it.就是这样。 Use UTF8String wherever memory is precious and for all UTF-8 I/O.在内存宝贵的地方和所有 UTF-8 I/O 中使用 UTF8String。 Use UCS2String wherever the string must be parsed and/or manipulated.在必须解析和/或操作字符串的任何地方使用 UCS2String。 You can convert between those two representations any time.您可以随时在这两种表示之间进行转换。

Alternatives & Improvements替代方案和改进

  • conversions from & to single-byte character encodings (eg ISO-8859-1) can be realized with help of plain translation tables, eg const wchar_t tt_iso88951[256] = {0,1,2,...};从 & 到单字节字符编码(例如 ISO-8859-1)的转换可以在普通转换表的帮助下实现,例如const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.以及从 UCS2 转换和转换的适当代码。

  • if UCS-2 is not sufficient, than switch to UCS-4 ( typedef std::basic_string<uint32_t> UCS2String )如果 UCS-2 不够用,则切换到 UCS-4( typedef std::basic_string<uint32_t> UCS2String

ICU or other unicode libraries? ICU 或其他 unicode 库?

For advanced stuff. 对于高级的东西。

  1. When you want to have wide characters stored in your string.当您想在字符串中存储宽字符时。 wide depends on the implementation. wide取决于实施。 Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target.如果我没记错的话,Visual C++ 默认为 16 位,而 GCC 默认取决于目标。 It's 32 bits long here.这里是 32 位长。 Please note wchar_t (wide character type) has nothing to do with unicode.请注意 wchar_t(宽字符类型)与 unicode 无关。 It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char.它只是保证它可以存储实现由其语言环境支持的最大字符集的所有成员,并且至少与 char 一样长。 You can store unicode strings fine into std::string using the utf-8 encoding too.您也可以使用utf-8编码unicode 字符串很好地存储std::string But it won't understand the meaning of unicode code points.但它不会理解 unicode 代码点的含义。 So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring.因此str.size()不会为您提供字符串中的逻辑字符数量,而只会提供存储在该字符串/wstring 中的 char 或 wchar_t 元素的数量。 For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.出于这个原因,gtk/glib C++ 包装人员开发了一个可以处理 utf-8 的Glib::ustring类。

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding.如果您的 wchar_t 是 32 位长,那么您可以使用utf-32作为 unicode 编码,并且您可以使用固定(utf-32 是固定长度)编码来存储处理 unicode 字符串。 This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.这意味着你的wstring的s.size()函数将返回wchar_t的元素逻辑字符的各适量。

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.是的,char 总是至少 8 位长,这意味着它可以存储所有 ASCII 值。
  3. Yes, all major compilers support it.是的,所有主要编译器都支持它。

I frequently use std::string to hold utf-8 characters without any problems at all.我经常使用 std::string 来保存 utf-8 字符而没有任何问题。 I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.我衷心建议在与使用 utf-8 作为本机字符串类型的 API 接口时这样做。

For example, I use utf-8 when interfacing my code with the Tcl interpreter.例如,我在将代码与 Tcl 解释器连接时使用 utf-8。

The major caveat is the length of the std::string, is no longer the number of characters in the string.主要的警告是 std::string 的长度,不再是字符串中的字符数。

A good question!好问题! I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:我认为数据编码(有时还涉及字符集)是一种内存表达机制,以便将数据保存到文件或通过网络传输数据,所以我回答这个问题:

1. When should I use std::wstring over std::string? 1. 我什么时候应该使用 std::wstring 而不是 std::string?

If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, eg read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them.如果编程平台或API函数是单字节的,我们想处理或解析一些Unicode数据,例如从Windows'.REG文件或网络2字节流中读取,我们应该声明std::wstring变量以方便处理它们。 eg: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.例如:wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061),我们可以使用 ws[0] 得到字符 '中' 和 ws[1] 得到字符 '国' 和 ws[2] 到获取字符“a”等。

2. Can std::string hold the entire ASCII character set, including the special characters? 2. std::string 能否保存整个 ASCII 字符集,包括特殊字符?

Yes.是的。 But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.'但是注意:美式ASCII,意味着每个0x00~0xFF octet代表一个字符,包括可打印的文本,如“123abc&*_&”,你说的特殊的,大多打印为'.' avoid confusing editors or terminals.避免混淆编辑器或终端。 And some other countries extend their own "ASCII" charset, eg Chinese, use 2 octets to stand for one character.其他一些国家扩展了自己的“ASCII”字符集,例如中文,使用2 个八位字节表示一个字符。

3.Is std::wstring supported by all popular C++ compilers? 3.所有流行的C++编译器都支持std::wstring吗?

Maybe, or mostly.也许,或者大部分。 I have used: VC++6 and GCC 3.3, YES我使用过:VC++6 和 GCC 3.3,是的

4. What is exactly a "wide character"? 4. 什么是“宽字符”?

a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters.宽字符主要表示使用 2 个八位字节或 4 个八位字节来容纳所有国家的字符。 2 octet UCS2 is a representative sample, and further eg English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61) 2 个八位字节 UCS2 是一个代表性的样本,进一步例如英语 'a',它的内存是 0x0061 的 2 个八位字节(相对于 ASCII 'a 的内存是 1 个八位字节 0x61)

  1. When you want to store 'wide' (Unicode) characters.当您想要存储“宽”(Unicode)字符时。
  2. Yes: 255 of them (excluding 0).是:其中 255 个(不包括 0 个)。
  3. Yes.是的。
  4. Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html这是一篇介绍性文章: http : //www.joelonsoftware.com/articles/Unicode.html

Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8.不满足于仅 256 个不同字符的应用程序可以选择使用宽字符(超过 8 位)或可变长度编码(C++ 术语中的多字节编码),例如 UTF-8。 Wide characters generally require more space than a variable-length encoding, but are faster to process.宽字符通常比可变长度编码需要更多空间,但处理速度更快。 Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.处理大量文本的多语言应用程序在处理文本时通常使用宽字符,但在将其存储到磁盘时将其转换为 UTF-8。

The only difference between a string and a wstring is the data type of the characters they store. stringwstring之间的唯一区别是它们存储的字符的数据类型。 A string stores char s whose size is guaranteed to be at least 8 bits, so you can use strings for processing eg ASCII, ISO-8859-15, or UTF-8 text.字符串存储char其大小保证至少为 8 位,因此您可以使用字符串进行处理,例如 ASCII、ISO-8859-15 或 UTF-8 文本。 The standard says nothing about the character set or encoding.该标准没有说明字符集或编码。

Practically every compiler uses a character set whose first 128 characters correspond with ASCII.实际上,每个编译器都使用一个字符集,其前 128 个字符与 ASCII 对应。 This is also the case with compilers that use UTF-8 encoding.使用 UTF-8 编码的编译器也是如此。 The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.在 UTF-8 或其他一些可变长度编码中使用字符串时要注意的重要一点是,索引和长度以字节而不是字符来衡量。

The data type of a wstring is wchar_t , whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring 的数据类型是wchar_t ,其大小在标准中没有定义,只是它必须至少与 char 一样大,通常是 16 位或 32 位。 wstring can be used for processing text in the implementation defined wide-character encoding. wstring 可用于在实现定义的宽字符编码中处理文本。 Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings.由于标准中没有定义编码,因此在字符串和 wstrings 之间进行转换并不简单。 One cannot assume wstrings to have a fixed-length encoding either.也不能假设 wstrings 具有固定长度的编码。

If you don't need multi-language support, you might be fine with using only regular strings.如果您不需要多语言支持,则只使用常规字符串可能没问题。 On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters.另一方面,如果您正在编写图形应用程序,通常情况下 API 仅支持宽字符。 Then you probably want to use the same wide characters when processing the text.那么您可能希望在处理文本时使用相同的宽字符。 Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters.请记住,UTF-16 是一种可变长度编码,这意味着您不能假设length()返回字符数。 If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy.如果 API 使用固定长度编码,例如 UCS-2,则处理变得容易。 Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.宽字符和 UTF-8 之间的转换很难以可移植的方式进行,但话说回来,您的用户界面 API 可能支持这种转换。

  1. when you want to use Unicode strings and not just ascii, helpful for internationalisation当你想使用 Unicode 字符串而不仅仅是 ascii 时,有助于国际化
  2. yes, but it doesn't play well with 0是的,但它不能很好地与 0 一起使用
  3. not aware of any that don't不知道任何不知道的
  4. wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes.宽字符是处理unicode字符固定长度表示的编译器特定方式,对于MSVC,它是一个2字节的字符,对于gcc,我理解它是4个字节。 and a +1 for http://www.joelonsoftware.com/articles/Unicode.html和 +1 http://www.joelonsoftware.com/articles/Unicode.html

There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio.这里有一些非常好的答案,但我认为我可以添加一些关于 Windows/Visual Studio 的内容。 Tis is based on my experience with VS2015.这是基于我对 VS2015 的经验。 On Linux, basically the answer is to use UTF-8 encoded std::string everywhere.在 Linux 上,基本上答案是在任何地方使用 UTF-8 编码的std::string On Windows/VS it gets more complex.在 Windows/VS 上,它变得更加复杂。 Here is why.这是为什么。 Windows expects strings stored using char s to be encoded using the locale codepage. Windows 期望使用char存储的字符串使用区域设置代码页进行编码。 This is almost always the ASCII character set followed by 128 other special characters depending on your location.这几乎总是 ASCII 字符集后跟 128 个其他特殊字符,具体取决于您的位置。 Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++.我只想说,这不仅仅是在使用 Windows API 时,还有其他三个主要地方这些字符串与标准 C++ 交互。 These are string literals, output to std::cout using << and passing a filename to std::fstream .这些是字符串文字,使用<<输出到std::cout并将文件名传递给std::fstream

I will be up front here that I am a programmer, not a language specialist.我将在此声明我是一名程序员,而不是语言专家。 I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here.我很欣赏 USC2 和 UTF-16 不一样,但出于我的目的,它们足够接近可以互换,我在这里使用它们。 I'm not actually sure which Windows uses, but I generally don't need to know either.我实际上不确定使用哪个 Windows,但我通常也不需要知道。 I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.我已经在这个答案中说明了 UCS2,如果我因为我对此事的无知而让任何人感到不安,我很高兴在我有问题时进行更改。

String literals字符串文字

If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage.如果您输入的字符串文字仅包含可以由您的代码页表示的字符,那么 VS 会根据您的代码页将它们存储在您的文件中,每个字符编码为 1 个字节。 Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different.请注意,如果您更改代码页或将源代码提供给使用不同代码页的其他开发人员,那么我认为(但尚未测试)角色最终会有所不同。 If you run your code on a computer using a different code page then I'm not sure if the character will change too.如果您在使用不同代码页的计算机上运行代码,那么我不确定字符是否也会改变。

If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode.如果您输入任何无法由您的代码页表示的字符串文字,VS 会要求您将文件另存为 Unicode。 The file will then be encoded as UTF-8.然后该文件将被编码为 UTF-8。 This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes.这意味着所有非 ASCII 字符(包括代码页上的那些字符)都将由 2 个或更多字节表示。 This means if you give your source to someone else the source will look the same.这意味着如果您将源提供给其他人,则源看起来相同。 However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?但是,在将源代码传递给编译器之前,VS 会将 UTF-8 编码文本转换为代码页编码文本,并且代码页中缺少的任何字符都将替换为? . .

The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal.保证在 VS 中正确表示 Unicode 字符串文字的唯一方法是在字符串文字前加上L使其成为宽字符串文字。 In this case VS will convert the UTF-8 encoded text from the file into UCS2.在这种情况下,VS 会将文件中的 UTF-8 编码文本转换为 UCS2。 You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string .然后,您需要将此字符串文字传递给std::wstring构造函数,或者您需要将其转换为 utf-8 并将其放入std::string Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string , but then you may as well have not used a wide string literal.或者,如果您愿意,您可以使用 Windows API 函数使用您的代码页对其进行编码,以将其放入std::string ,但是您可能还没有使用宽字符串文字。

std::cout std::cout

When outputting to the console using << you can only use std::string , not std::wstring and the text must be encoded using your locale codepage.当使用<<输出到控制台时,您只能使用std::string ,而不能使用std::wstring并且文本必须使用您的语言环境代码页进行编码。 If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ?如果您有std::wstring则必须使用 Windows API 函数之一对其进行转换,并且代码页上没有的任何字符都将替换为? (maybe you can change the character, I can't remember). (也许你可以改变角色,我不记得了)。

std::fstream filenames std::fstream 文件名

Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. Windows 操作系统使用 UCS2/UTF-16 作为其文件名,因此无论您的代码页如何,您都可以拥有带有任何 Unicode 字符的文件。 But this means that to access or create files with characters not on your codepage you must use std::wstring .但这意味着要访问或创建包含不在代码页上的字符的文件,您必须使用std::wstring There is no other way.没有其他办法。 This is a Microsoft specific extension to std::fstream so probably won't compile on other systems.这是对std::fstream的 Microsoft 特定扩展,因此可能无法在其他系统上编译。 If you use std::string then you can only utilise filenames that only include characters on your codepage.如果您使用 std::string,那么您只能使用仅包含代码页上的字符的文件名。

Your options您的选择

If you are just working on Linux then you probably didn't get this far.如果您只是在 Linux 上工作,那么您可能还没有走到这一步。 Just use UTF-8 std::string everywhere.只需在任何地方使用 UTF-8 std::string

If you are just working on Windows just use UCS2 std::wstring everywhere.如果您只是在 Windows 上工作,请在任何地方使用 UCS2 std::wstring Some purists may say use UTF8 then convert when needed, but why bother with the hassle.一些纯粹主义者可能会说使用 UTF8 然后在需要时转换,但为什么要麻烦呢。

If you are cross platform then it's a mess to be frank.如果你是跨平台的,那么坦率地说这是一团糟。 If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console.如果您尝试在 Windows 上随处使用 UTF-8,那么您需要非常小心您的字符串文字和输出到控制台。 You can easily corrupt your strings there.您可以轻松地在那里损坏您的字符串。 If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream , so you have to do the conversion, but there is no risk of corruption.如果您在 Linux 上的任何地方都使用std::wstring ,那么您可能无法访问std::fstream的宽版本,因此您必须进行转换,但不存在损坏的风险。 So personally I think this is a better option.所以我个人认为这是一个更好的选择。 Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.许多人会不同意,但我并不孤单——例如,这是 wxWidgets 所采取的路径。

Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code另一种选择是在 Linux unicodestringstd::string ,在 Windows unicodestringstd::string std::wstring ,并有一个名为 UNI() 的宏,它在 Windows 上以 L 为前缀,在 Linux 上没有前缀,然后是代码

#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>

#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
    std::string result;
    //Call WideCharToMultiByte to do the conversion
    return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
    return str;
}
#endif

int main()
{

    unicodestring fileName(UNI("fileName"));
    std::ofstream fout;
    fout.open(fileName);
    std::cout << formatForConsole(fileName) << std::endl;
    return 0;
}

would be fine on either platform I think.我认为在任何一个平台上都可以。

Answers答案

So To answer your questions所以要回答你的问题

1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never. 1) 如果您一直在为 Windows 编程,那么如果是跨平台,则可能一直都是,除非您想处理 Windows 上可能的损坏问题或使用特定于平台的#ifdefs编写一些代码来解决差异,如果只是使用 Linux 然后永远不会。

2)Yes. 2)是的。 In addition on Linux you can use it for all Unicode too.此外,在 Linux 上,您也可以将它用于所有 Unicode。 On Windows you can only use it for all unicode if you choose to manually encode using UTF-8.在 Windows 上,如果您选择使用 UTF-8 手动编码,则只能将其用于所有 unicode。 But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage.但是 Windows API 和标准 C++ 类将期望使用区域设置代码页对std::string进行编码。 This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.这包括所有 ASCII 以及另外 128 个字符,这些字符会根据您的计算机设置使用的代码页而变化。

3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char 3)我相信是这样,但如果不是,那么它只是使用wchar_t而不是char的 'std::basic_string' 的简单类型定义

4)A wide character is a character type which is bigger than the 1 byte standard char type. 4)宽字符是一种比1字节标准char类型大的char类型。 On Windows it is 2 bytes, on Linux it is 4 bytes.在 Windows 上它是 2 个字节,在 Linux 上它是 4 个字节。

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english 1) 正如 Greg 所提到的,wstring 有助于国际化,届时您将以英语以外的语言发布您的产品

4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character 4)检查宽字符http://en.wikipedia.org/wiki/Wide_character

When should you NOT use wide-characters?什么时候不应该使用宽字符?

When you're writing code before the year 1990.当您在 1990 年之前编写代码时。

Obviously, I'm being flip, but really, it's the 21st century now.显然,我正在翻转,但实际上,现在是 21 世纪。 127 characters have long since ceased to be sufficient. 127 个字符早已不够用。 Yes, you can use UTF8, but why bother with the headaches?是的,您可以使用 UTF8,但是为什么要为这些头痛而烦恼呢?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM