简体繁体 English

在C ++中内部使用UTF-8，UTF-16，UTF-32？

[英]Working with UTF-8 vs UTF-16 vs UTF-32 internally within C++?

原文 2014-09-07 17:48:47 8 1 c++/ unicode/ utf-8

I only have experience with processing ASCII (single byte characters) and have read a number of posts on how people process Unicode differently which present their own set of issues. 我只有处理ASCII（单字节字符）的经验，并且阅读过许多有关人们如何不同地处理Unicode的文章，这些文章提出了自己的问题。

At this point of my very limited exposure to Unicode, I've read that internal processing with UTF-16 presents portability and other issues . 在我非常有限地接触Unicode的这一点上，我读到了UTF-16的内部处理带来了可移植性和其他问题 。

I feel that UTF-32 makes more sense than UTF-16 since all Unicode characters fit within 4 bytes but would consume more resources, especially if you are mainly dealing with ISO-8859-1 characters. 我觉得UTF-32比UTF-16更有意义，因为所有Unicode字符都可以容纳4个字节，但是会占用更多资源，尤其是在您主要处理ISO-8859-1字符的情况下。

I humbly feel that UTF-8 could be an ideal format to work with internally (especially for case where you deal mainly with English and Latin based characters) since the ASCII range of characters would be handled byte by byte very efficiently. 我谦虚地感觉到UTF-8可能是内部处理的理想格式 （特别是在您主要处理基于英语和拉丁语的字符的情况下），因为可以非常有效地逐字节处理ASCII字符范围。 Characters from the Latin alphabet would consume two bytes and other characters would consume more bytes of course. 当然，来自拉丁字母的字符将占用两个字节，而其他字符将占用更多字节。

Another advantage that I see is that UTF-8 strings could be stored within regular C++ std::string or C string arrays which seems so natural. 我看到的另一个优点是UTF-8字符串可以存储在常规C ++ std :: string或C字符串数组中 ，这看起来很自然。

The disadvantage for using UTF-8 for me at least is that I have not found any libraries to support UTF-8 internally. 至少对我而言，使用UTF-8的缺点是我尚未在内部找到任何支持UTF-8的库。 For example, I have not found any libraries for UTF-8 case conversion and substring operations. 例如，我还没有找到任何用于UTF-8大小写转换和子字符串操作的库。

Another disadvantage for me is that I have not found functions to parse bytes within a UTF-8 string for character processing. 对我来说，另一个缺点是我没有找到用于解析UTF-8字符串中的字节以进行字符处理的函数。

Would it be feasible to work with UTF-8 internally and are there any support libraries available for this purpose? 在内部使用UTF-8可行，并且为此目的有可用的支持库吗？ I do hope so but if not, I think that my best option would be to forget using UTF-8 internally and use Boost::Locale since I've read that ICU is a mature library used by many to handle Unicode. 我确实希望如此，但如果不是这样，我想我最好的选择是忘记在内部使用UTF-8并使用Boost :: Locale，因为我已经知道ICU是许多人用来处理Unicode的成熟库。

I would really like to hear your opinions on this matter. 我真的很想听听您对此事的看法。

1 个解决方案

I bumped into my very old answer and I'll tell you what I ended up doing. 我碰到了很老的答案，我会告诉你我最终要做什么。 I decided to stick with UTF-8 and store my data in std::string or single byte char arrays . 我决定坚持使用UTF-8并将数据存储在std :: string或单字节char数组中 。 There was never a need for me to use multi-byte characters! 从未需要我使用多字节字符！

The first library that I used was UTF8-CPP which is very easy to bring into your app and use. 我使用的第一个库是UTF8-CPP，它很容易引入您的应用程序并使用。 But you soon find that you need more and more capability. 但是您很快就会发现您需要越来越多的功能。

I really wanted to avoid using ICU because it is such a large library, but once you build it and get it installed, you begin to wish that you had done it in the first place because it has everything you need and much, much more. 我真的想避免使用ICU，因为它是如此大的库，但是一旦构建并安装了ICU，您就开始希望自己做完了，因为它具有您需要的一切，还有更多。

What are my benefits you may wonder: 您可能想知道我有什么好处：

I write truly portable code that builds under VC++ for Windows or GCC for Linux. 我编写了真正可移植的代码，这些代码是在Windows的VC ++或Linux的GCC下构建的。
ICU has everything, and I mean everything you need concerning unicode. ICU拥有一切，我的意思是您需要的有关Unicode的一切。
I am able to stick with my beloved std::string and char arrays. 我能够坚持我心爱的std :: string和char数组。
I use many open source libraries in my apps with zero issues. 我在我的应用程序中使用了许多开源库，而零个问题。 For example, I use RapidJson for my JSON to create in-memory JSON objects containing UTF-8 data. 例如，我将RapidJson用于我的JSON，以创建包含UTF-8数据的内存中JSON对象。 I'm able to pass them to a web server or write them to disk, etc. Really simple. 我能够将它们传递到Web服务器或将它们写入磁盘等。真的很简单。
I store my data into Firebird SQL but you need to specify your varchar and char field types as UTF8. 我将数据存储到Firebird SQL中，但是您需要将varchar和char字段类型指定为UTF8。 This means that your strings will be stored as mutli-byte in the database. 这意味着您的字符串将以多字节形式存储在数据库中。 But this is totally transparent to you, the developer. 但这对您（开发人员）完全透明。 I am certain that this applies to other SQL databases as well. 我确信这也适用于其他SQL数据库。

Drawbacks: 缺点：

Large library, very scary and confusing at first. 大型图书馆，起初非常令人恐惧和混乱。
The C++ was not written by C++ experts (like the Boost developers). C ++不是由C ++专家（例如Boost开发人员）编写的。 But the code is totally stable and fast. 但是代码完全稳定且快速。 You may not like the syntax used though. 您可能不喜欢使用的语法。 What I've done is to "wrap" common procedures with my code. 我要做的是用代码“包装”通用过程。 This pretty much means that I include my own UTF-8 library which wraps the ICU uglies. 这几乎意味着我包括包装UCU丑陋的UTF-8库。 Don't let this bother you because ICU is totally stable and fast. 不要让这件事困扰您，因为ICU完全稳定且快速。
I personally dynamically link ICU into my applications. 我个人将ICU动态链接到我的应用程序中。 This means that I first built ICU dynamically for my Win and Linux 64 bit environments. 这意味着我首先为Win和Linux 64位环境动态构建了ICU。 In the case of Windows, I store the dlls in a folder somewhere and add that to my Windows path so that any app that requires ICU can find the dlls. 对于Windows，我将dll存储在某个位置的文件夹中，并将其添加到Windows路径中，以便任何需要ICU的应用程序都可以找到这些dll。

When I looked at built-in language features, I found several lacking such as lower/upper case conversion, word boundaries, counting characters, accent sensitivity, string manipulation such as substrings, etc. Local support is also totally amazing. 当我查看内置语言功能时，发现一些不足，例如小写/大写转换，单词边界，计数字符，重音敏感度，诸如子字符串之类的字符串处理等。本地支持也非常令人惊讶。

I guess that summarizes entire exercise in UTF-8. 我想这总结了UTF-8中的整个练习。