如何从 (w) 字符串中获取 unicode char 的 utf-8 int 值？

Question

情况

我需要一个函数，它需要一个字符串并将所有非 ascii 字符编码为 utf-8 作为十六进制数并用它替换它。

例如，像“djvӷdio”这样的词中的ӷ 应替换为“d3b7”，而其余部分保持不变。

Explanation:
ӷ equals int 54199 and in hexadecimal d3b7
djvӷdio --> djvd3b7dio

我已经有一个函数可以返回 int 的十六进制值。

我的机器

库本图，19.10
编译器：g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008

我的想法

1.想法

std::string encode_utf8(const std::string &str);

通过使用上面的函数，我遍历包含 unicode 的整个字符串，如果当前字符是非 ascii，我将用它的十六进制值替换它。

问题：

用 unicode 遍历字符串并不聪明，因为 unicode char 由多达 4 个字节组成，与普通 char 不同。 因此，一个unicode 字符可以被视为输出垃圾的多个字符。 简单来说，字符串不能被索引。

2. 想法

std::string encode_utf8(const std::wstring &wstr);

再次，我用 unicode 字符遍历整个字符串，如果当前字符是非 ascii，我用它的十六进制值替换它。

问题：

索引现在可以工作，但它返回一个带有相应 utf-32 数字的 wchar_t，但我绝对需要 utf-8 数字。

如何从字符串中获取字符以获取 utf-8 十进制数？

Answer 1

您的输入字符串是 UTF8 编码的，这意味着每个字符都由 1 到 4 个字节编码。 您不能只是扫描字符串并转换它们，除非您的循环了解 Unicode 字符是如何在 UTF8 中编码的。

您需要一个 UTF8 解码器。

幸运的是，如果您只需要解码，那么您可以使用真正轻量级的。 UTF8-CPP几乎是一个标头，并且具有为您提供单个 Unicode 字符的功能。 utf8::next将为您提供uint32_t （“最大”字符的代码点适合这种类型的对象）。 现在您可以简单地查看该值是否小于 128：如果是，则转换为char并追加； 如果不是，请以您认为合适的任何方式序列化整数。

不过，我恳请您考虑这是否真的是您想做的。 你的输出将是模棱两可的。 无法确定其中的一堆数字是实际数字，还是某些非 ASCII 字符的表示。 为什么不坚持使用原始的 UTF8 编码，或者使用 HTML 实体编码或引用打印之类的东西？ 这些编码被广泛理解和广泛支持。

Answer 2

我刚刚解决了这个问题：

std::string Tools::encode_utf8(const std::wstring &wstr)
{
    std::string utf8_encoded;

    //iterate through the whole string
    for(size_t j = 0; j < wstr.size(); ++j)
    {
        if(wstr.at(j) <= 0x7F)
            utf8_encoded += wstr.at(j);
        else if(wstr.at(j) <= 0x7FF)
        {
            //our template for unicode of 2 bytes
            int utf8 = 0b11000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the last 5 remaining bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000111'11000000) << 2;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x"));
        }
        else if(wstr.at(j) <= 0xFFFF)
        {
            //our template for unicode of 3 bytes
            int utf8 = 0b11100000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the last 4 remaining bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b11110000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x"));
        }
        else if(wstr.at(j) <= 0x10FFFF)
        {
            //our template for unicode of 4 bytes
            int utf8 = 0b11110000'10000000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the next 6 bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000011'11110000'00000000) << 4;

            /*
             * get the last 3 remaining bits
             * put them 6 to the left so that the 10xxxx from 10xxxxxx (third byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00011100'00000000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x").insert(12, "\\x"));
        }
    }
    return utf8_encoded;
}

如何从 (w) 字符串中获取 unicode char 的 utf-8 int 值？

问题描述

情况

我的机器

我的想法

1.想法

2. 想法

2 个解决方案

解决方案1
2 2019-12-10 15:32:59

解决方案2
0 已采纳 2019-12-11 20:55:12

如何从 (w) 字符串中获取 unicode char 的 utf-8 int 值？

问题描述

情况

我的机器

我的想法

1.想法

2. 想法

2 个解决方案

解决方案1 2 2019-12-10 15:32:59

解决方案2 0 已采纳 2019-12-11 20:55:12

解决方案1
2 2019-12-10 15:32:59

解决方案2
0 已采纳 2019-12-11 20:55:12