简体   繁体   English

C ++中的字符串和字符编码

[英]Strings and character encoding in C++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. 我在C ++中阅读了一些关于字符串和字符编码的最佳实践的帖子,但是我在寻找一种在我看来相当简单和正确的通用方法时遇到了一些困难。 Could I ask for comments on the following? 我可以就以下问题征询意见吗? I'm inclined to use UTF-8 and UTF-32, and to define something like: 我倾向于使用UTF-8和UTF-32,并定义如下内容:

typedef std::string string8;
typedef std::basic_string<uint32_t> string32;

The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. string8类将用于UTF-8,并且具有单独的类型只是对编码的提醒。 An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8. 另一种方法是将string8作为std :: string的子类,并删除不太适合UTF-8的方法。

The string32 class would be used for UTF-32 when a fixed character size is desired. 当需要固定字符大小时,string32类将用于UTF-32。

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two. UTF-8 CPP函数utf8 :: utf8to32()和utf8 :: utf32to8(),或甚至更简单的包装函数,将用于在两者之间进行转换。

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job. 如果你打算只传递字符串并且从不检查它们,你可以使用普通的std::string虽然它是一个穷人的工作。

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. 问题是,大多数框架,甚至标准,都在内存中愚蠢地(我认为)强制编码。 I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data. 我说愚蠢,因为编码应该只对接口有影响,而那些编码不适合数据的内存中操作。

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data. 此外,编码很容易(它是一个简单的转换CodePoint - >字节和反向),而主要的困难实际上是操纵数据。

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. 对于8位或16位,您可能会在中间切割字符,因为std::stringstd::wstring都不知道Unicode字符是什么。 Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid. 更糟糕的是,即使使用32位编码,也存在将字符与适用于它的变音符号分开的风险,这也是愚蠢的。

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned. 因此,就标准而言,在C ++中对Unicode的支持非常低。

If you really wish to manipulate Unicode string, you need a Unicode aware container. 如果您真的希望操作Unicode字符串,则需要一个支持Unicode的容器。 The usual way is to use the ICU library, though its interface is really C-ish. 通常的方法是使用ICU库,虽然它的界面真的是C-ish。 However you'll get everything you need to actually work in Unicode with multiple languages. 但是,您将获得使用多种语言实际使用Unicode所需的一切。

The traits approach described here might be helpful. 这里描述的特征方法可能会有所帮助。 It's an old but useful technique. 这是一种古老但有用的技术。

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. 没有指定必须为字符串,wstring等使用什么字符编码。常见的方法是在宽字符串中使用unicode。 What types and encodings should be used depends on your requirements. 应使用哪些类型和编码取决于您的要求。

If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). 如果您只需要将数据从A传递给B,请选择带有UTF-8编码的std :: string(不要引入新类型,只需使用std :: string)。 If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux. 如果必须使用字符串(extract,concat,sort,...),请选择std :: wstring,并在Windows上编码UCS2 / UTF-16(仅限BMP),在Linux上编辑UCS4 / UTF-32。 The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results. 好处是固定大小:每个字符的大小为2(对于UCS4为4),而带有UTF-8的std :: string返回错误的length()结果。

For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. 对于转换,您可以检查sizeof(std :: wstring :: value_type)== 2或4以选择UCS2或UCS4。 I'm using the ICU library, but there may be simple wrapper libs. 我正在使用ICU库,但可能有简单的包装器库。

Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). 不建议从std :: string派生,因为basic_string不是为(缺少虚拟成员等)而设计的。 If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this. 如果你真的真的需要自己的类型,比如std :: basic_string <my_char_type>为此写一个自定义的专门化。

The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2). 新的C ++ 0x标准将wstring_convert <>和wbuffer_convert <>定义为使用std :: codecvt从窄字符集转换为宽字符集(例如UTF-8到UCS2)。 Visual Studio 2010 has already implemented this, afaik. Visual Studio 2010已经实现了这个,afaik。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM