简体   繁体   English

C ++中的Unicode处理

[英]Unicode Processing in C++

在C ++中进行Unicode处理的最佳实践是什么?

  • Use ICU for dealing with your data (or a similar library) 使用ICU处理您的数据(或类似的库)
  • In your own data store, make sure everything is stored in the same encoding 在您自己的数据存储中,确保所有内容都以相同的编码存储
  • Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want. 确保你总是使用unicode库来处理字符串长度,大小写状态等普通任务。除非是你想要的定义,否则不要使用像is_alpha这样的标准库内置is_alpha
  • I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this. 我不能说够了: 如果你关心正确性,永远不要遍历string的索引,总是使用你的unicode库。

If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf 如果您不关心与以前的C ++标准的向后兼容性,那么当前的C ++ 11标准内置了Unicode支持: http//www.open-std.org/JTC1/SC22/WG21/docs/papers/2011 /n3242.pdf

So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. 因此,在C ++中进行Unicode处理的真正最佳实践是使用内置工具。 That isn't always a possibility with older code bases though, with the standard being so new at present. 然而,对于较旧的代码库,这并不总是可能的,目前标准是如此新颖。

EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. 编辑:为了澄清,C ++ 11是Unicode识别的,因为它现在支持Unicode文字和Unicode字符串。 However, the standard library has only limited support for Unicode processing and conversion. 但是,标准库对Unicode处理和转换的支持有限 For your current needs this may be enough. 对于您目前的需求,这可能就足够了。 However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. 但是,如果您现在需要进行大量繁重的工作,那么您可能仍需要使用ICU之类的东西进行更深入的处理。 There are some proposals currently in the works to include more robust support for text conversion between different encodings. 有一些建议, 目前的作品 ,包括针对不同编码之间进行文本转换更强大的支持。 My guess (and hope) is that this will be part of the next technical report . 我的猜测(和希望)是这将成为下一份技术报告的一部分

Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent. 我们公司(和其他公司)使用最初由Taligent开发的开源国际组件 (ICU)库。

It handles strings, locales, conversions, date/times, collation, transformations, et. 它处理字符串,区域设置,转换,日期/时间,整理,转换等。 al. 人。

Start with the ICU Userguide ICU用户指南开始

Here is a checklist for Windows programming: 这是Windows编程的清单:

  • All strings enclosed in _T("my string") 所有字符串都包含在_T(“我的字符串”)中
  • strlen() etc. functions replaced with _tcslen() etc. strlen()等函数替换为_tcslen()等。
  • Use LPTSTR and LPCTSTR instead of char * and const char * 使用LPTSTR和LPCTSTR代替char *和const char *
  • When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties. 在Dev Studio中启动新项目时,请务必确保在项目属性中选择了Unicode选项。
  • For C++ strings, use std::wstring instead of std::string 对于C ++字符串,请使用std :: wstring而不是std :: string

Look at Case insensitive string comparison in C++ 在C ++中查看Case不敏感的字符串比较

That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx 该问题有一个关于Unicode的Microsoft文档的链接: http//msdn.microsoft.com/en-us/library/cc194799.aspx

If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. 如果您在该文章旁边的MSDN上查看左侧导航端,您应该找到许多与Unicode功能相关的信息。 It is part of a chapter on "Encoding Characters" ( http://msdn.microsoft.com/en-us/library/cc194786.aspx ) 它是“编码字符”一章的一部分( http://msdn.microsoft.com/en-us/library/cc194786.aspx

It has the following subsections: 它有以下小节:

  • The Code-Page Model 代码页模型
  • Double-Byte Character Sets in Windows Windows中的双字节字符集
  • Unicode 统一
  • Compatibility Issues in Mixed Environments 混合环境中的兼容性问题
  • Unicode Data Conversion Unicode数据转换
  • Migrating Windows-Based Programs to Unicode 将基于Windows的程序迁移到Unicode
  • Summary 摘要

Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want! 虽然这对每个人来说可能不是最佳实践,但如果需要,您可以编写自己的C ++ UNICODE例程!

I just finished doing it over a weekend. 我刚刚结束了一个周末。 I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly. 我学到了很多东西,虽然我不保证它100%没有bug,但我做了很多测试,似乎工作正常。

My code is under the New BSD license and can be found here: 我的代码在新BSD许可下,可在此处找到:

http://code.google.com/p/netwidecc/downloads/list http://code.google.com/p/netwidecc/downloads/list

It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. 它被称为WSUCONV,带有一个示例main()程序,可在UTF-8,UTF-16和标准ASCII之间进行转换。 If you throw away the main code, you've got a nice library for reading / writing UNICODE. 如果你扔掉主代码,你就有了一个很好的读/写UNICODE库。

As has been said above a library is the best bet when using a large system. 如上所述,在使用大型系统时,库是最好的选择。 However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). 但有时候你确实想要自己处理事情(可能是因为库可以用于许多资源,比如在微控制器上)。 In this case you want a simple library that you can copy the parts out of for the things you actually need. 在这种情况下,您需要一个简单的库,您可以将这些部件复制出来以获取您实际需要的东西。

Willow Schlanger's example code seems like a good one (see his answer for more details). Willow Schlanger的示例代码看起来很好(有关详细信息,请参阅他的答案)。

I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of. 我还发现了另一个代码较小的代码,但是缺少完整的错误检查,只处理UTF-8,但更容易从中取出。

Here's a list of the embedded libraries that seem decent. 这是一个看似体面的嵌入式库列表。

Embedded libraries 嵌入式库

看看UTF-8 Everywhere的建议

使用IBM的Unicode国际组件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM