简体繁体 English

C ++ UTF-8轻量级和允许的代码？

[英]C++ UTF-8 lightweight & permissive code?

原文 2010-06-08 10:56:31 7 3 c++/ utf-8/ glib

Anyone know of a more permissive license (MIT / public domain) version of this: 任何人都知道以下更宽松的许可证（MIT /公共领域）版本：

http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

('drop-in' replacement for std::string thats UTF-8 aware) （“插入”替换std :: string表示支持UTF-8）

Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even) 轻巧，可做所有我需要的甚至更多的工作（怀疑的是，我将甚至使用UTF-XX转换）

I really don't want to be carrying ICU around with me. 我真的不想随身携带ICU。

3 个解决方案

std::string is fine for UTF-8 storage. std :: string适用于UTF-8存储。
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base. 如果您需要分析文本本身，那么UTF-8意识将对您无济于事，因为Unicode中有太多内容无法在代码点基础上使用。

Take a look on Boost.Locale library (it uses ICU under the hood): 看一下Boost.Locale库（它在后台使用ICU）：

Reference http://cppcms.sourceforge.net/boost_locale/html/ 参考http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html 教程http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/ 下载https://sourceforge.net/projects/cppcms/files/

It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage. 它不是轻量级的，但是它允许您正确处理Unicode，并且使用std::string作为存储。

If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. 如果您希望找到可识别Unicode的轻量级库来处理字符串，那么您将找不到这种东西，因为Unicode不是轻量级的。 And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access. 甚至相对“简单”的东西（例如大写，小写转换或Unicode规范化）也需要复杂的算法和Unicode数据库访问。

If you need an ability to iterate over Code points (that BTW are not characters) take a look on http://utfcpp.sourceforge.net/ 如果您需要遍历代码点（BTW 不是字符）的功能，请访问http://utfcpp.sourceforge.net/

Answer to comment: 回答评论：

1) Find file formats for files included by me 1）查找我包含的文件的文件格式

std::string::find is perfectly fine for this. std :: string :: find对此非常合适。

2) Line break detection 2）断线检测

This is not a simple issue. 这不是一个简单的问题。 Have you ever tried to find a line-break in Chinese/Japanese text? 您是否曾经尝试过找到中文/日语文本的换行符？ Probably not as space does not separate words. 可能不是因为空格不会分隔单词。 So line-break detection is hard job. 因此，断行检测非常困难。 (I don't think even glib does this correctly, I think only pango has something like that) （我认为即使glib也不能正确执行此操作，我认为只有pango才有这样的功能）

And of course Boost.Locale does this and correctly. 当然，Boost.Locale可以正确地做到这一点。

And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine. 而且，如果只需要对欧洲语言执行此操作，则只需搜索空格或标点符号即可，因此std::string::find更好。

3) Character (or now, code point) counting Looking at utfcpp thx 3）字符（或现在的代码点）计数看utfcpp thx

Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. 字符不是代码点，例如希伯来语单词Shalom-“שָלוֹם”由4个字符和6个Unicode点组成，其中两个代码点用于元音。 Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points. 与欧洲语言相同，在欧洲语言中用两个代码点表示单个字符，例如：“ü”可以表示为“ u”和“¨”-两个代码点。

So if you are aware of these issues then utfcpp will be fine, otherwise you will not find anything simpler. 因此，如果您知道这些问题，则utfcpp会很好，否则，您将找不到更简单的方法。

您可能对BjörnHöhrmann的“ 灵活而经济的UTF-8解码器”感兴趣，但这绝不是std::string替代品。

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. 我从没使用过，但不久前偶然发现了这个UTF-8 CPP库，并且有足够的好感可以将其添加为书签。 It is released on a BSD like license IIUC. 它在IISD许可证之类的BSD上发布。

It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good! 它仍然依赖于std::string作为字符串，并提供了许多实用程序功能来帮助检查字符串是否确实为UTF-8，计数字符数，返回或向前移一个字符……它确实很小，可以生存仅在头文件中：看起来非常好！