简体   繁体   English

如何在C ++中将UTF-16代理十进制转换为UNICODE

[英]How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

I got some string data from parameter such as �� 我从��等参数中获得了一些字符串数据 .

These are Unicode's UTF-16 surrogate pairs represented as decimal. 这些是Unicode的UTF-16代理对,用十进制表示。

How can I convert them to Unicode code points such as "U+1F62C" with the standard library? 如何使用标准库将它们转换为Unicode代码点,例如“ U + 1F62C”?

You can easily to it by hand . 您可以轻松地手动处理它。 The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. 从高unicode点传递到代理对并返回的算法并不难。 Wikipedia page on UTF16 says: UTF16上的Wikipedia页面上说:

U+10000 to U+10FFFF U + 10000至U + 10FFFF

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF. 从代码点减去0x010000,剩下一个20位数字,范围为0..0x0FFFFF。
  • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF. 前十位(0..0x03FF范围内的数字)被添加到0xD800,以给出第一个16位代码单元或高代理,其范围为0xD800..0xDBFF。
  • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF. 低十位(也位于0..0x03FF范围内)添加到0xDC00,以提供第二个16位代码单元或低替代,其范围为0xDC00..0xDFFF。

That's just bitwise and, or and shift and can trivially be implemented in C or C++. 那只是按位和,或和移位,可以用C或C ++轻松实现。


As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt is your friend, provided you can compile in C++11 mode or higher. 就像您说的要使用标准库一样,您所需要的是从两个16位UTF-16代理转换为一个32位unicode代码点,因此只要您可以在C ++ 11模式下进行编译, codecvt就是您的朋友或更高。

Here is an example processing your values on a little endian architecture: 这是一个在小端架构上处理您的值的示例:

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
    std::codecvt_utf16<char32_t, 0x10ffffUL,
    std::codecvt_mode::little_endian> cvt;
    mbstate_t state;

    char16_t pair[] = { 55357, 56842 };
    const char16_t *next;

    char32_t u[2];
    char32_t *unext;

    cvt.in(state, (const char *) pair, (const char *) (pair + 2),
        (const char *&) next, u, u+1, unext);

    std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
        << std::endl;
    std::cout << std::hex << (uint32_t) u[0] << std::endl;

    return 0;
}

Output is as expected: 输出是预期的:

d83d de0a
1f60a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM