简体繁体 English

C ++ unicode UTF-16编码

[英]C++ unicode UTF-16 encoding

原文 2010-04-21 02:42:29 6 2 c++/ unicode/ encoding/ utf-16

I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\我\的\上\网\主\页". 我有一个宽字符串是L“hao123--我的上网主页”，它必须编码为“hao123 - \\ u6211 \\ u684 \\ u4E0A \\ u7F51 \\ u4E3B \\ u9875”。 I was told that the encoded string is a special “%uNNNN” format for encoding Unicode UTF-16 code points. 我被告知编码字符串是一种特殊的“％uNNNN”格式，用于编码Unicode UTF-16代码点。 In this website , it tells me it's JavaScript escapes. 在这个网站上，它告诉我它是JavaScript逃脱。 But I don't know how to encode it with C++. 但我不知道如何使用C ++对其进行编码。

It there any library to get this to work? 有没有图书馆可以让它工作？ or give me some tips. 或者给我一些提示。

Thanks my friends! 谢谢我的朋友！

2 个解决方案

Embedding unicode in string literals is generally not a good idea and is not portable; 在字符串文字中嵌入unicode通常不是一个好主意，也不是可移植的; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. 无法保证wchar_t为16位且编码为UTF-16。 While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). 虽然在Windows上使用Microsoft Visual C ++（特定的C ++实现）可能就是这种情况，但在OS X的GCC（另一种实现）上，wchar_t是32位。 If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. 如果您有某种本地化的字符串常量，最好使用某种特定编码的配置文件，并将它们解释为已经在该编码中编码。 The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Unicode的国际组件（ICU）库为解释和处理unicode提供了很好的支持。 Another good library for converting between (but not interpreting) encoding formats is libiconv . 另一个用于在（但不解释）编码格式之间进行转换的好库是libiconv 。

Edit 编辑
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (ie an ASCII string where unicode characters are represented by "\\u\u0026quot; followed by the numeric value of the character), then use the following pseudo-code: 我有可能误解你的问题...如果问题是你已经有一个UTF-16字符串，并且你想将它转换为“unicode-escape ASCII”（即一个ASCII字符串，其中unicode字符由“\\ u”后跟字符的数值），然后使用以下伪代码：

for each codepoint represented by the UTF-16 encoded string:
    if the codepoint is in the range [0,0x7F]:
       emit the codepoint casted to a char
    else:
       emit "\u" followed by the hexadecimal digits representing codepoint

Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. 现在，为了获得代码点，有一个非常简单的规则...... UTF-16字符串中的每个元素都是一个代码点，除非它是“代理对”的一部分，在这种情况下它和它之后的元素构成一个单一代码点。 If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. 如果是这样，则unicode标准定义了将“前导代理”和“尾随代理”组合成单个代码点的过程。 Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. 注意，UTF-8和UTF-16都是可变长度编码......如果没有用可变长度表示，则代码点需要32位。 The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints. Unicode转换格式（UTF）常见问题解答解释了编码以及如何识别代理对以及如何将它们组合到代码点中。