如何将代码点转换为 utf-8？

Question

I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).我有一些读取 unicode 代码点的代码（在字符串 0xF00 中转义）。

Since im using boost , I'm speculating if the following is best (and correct) approach:由于我使用boost ，我在推测以下是否是最佳（和正确）方法：

unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);

? ?

Answer 1

You can do this with the standard library using std::wstring_convert to convert UTF-32 (code points) to UTF-8:您可以使用标准库执行此操作，使用std::wstring_convert将 UTF-32（代码点）转换为 UTF-8：

#include <locale>
#include <codecvt>

std::string codepoint_to_utf8(char32_t codepoint) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
    return convert.to_bytes(&codepoint, &codepoint + 1);
}

This returns a std::string whose size is 1, 2, 3 or 4 depending on how large codepoint is.这将返回一个std::string其大小为 1、2、3 或 4，具体取决于codepoint大小。 It will throw a std::range_error if the code point is too large (> 0x10FFFF, the max unicode code point).如果代码点太大（> 0x10FFFF，最大 unicode 代码点），它将抛出std::range_error 。

Your version with boost seems to be doing the same thing.您的带有 boost 的版本似乎也在做同样的事情。 The documentation says that the utf_to_utf function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t , it will be a "correct" approach, that will work on systems where unsigned int isn't the same size as char32_t . 文档说utf_to_utf函数将 UTF 编码转换为另一种编码，在本例中为 32 到 8。如果您使用char32_t ，这将是一种“正确”的方法，适用于unsigned int大小不同的系统作为char32_t 。

// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
    return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}

Answer 2

As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.如前所述，这种形式的代码点是（方便地）UTF-32，所以您要寻找的是转码。

For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8 .对于不依赖自 C++17 以来已弃用的函数的解决方案，并且不是很丑陋，并且也不需要大量的第三方库，您可以使用非常轻量级的UTF8-CPP （四个小标题！）及其函数utf8::utf32to8 。

It's going to look something like this:它看起来像这样：

const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;

try
{
   utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
   // something
}

(There's also a utf8::unchecked::utf32to8 , if you're allergic to exceptions.) （如果您对异常过敏，还有一个utf8::unchecked::utf32to8 。）

(And consider reading into vector<char8_t> or std::u8string , since C++20). （并考虑读入vector<char8_t>或std::u8string ，因为 C++20）。

(Finally, note that I've specifically used uint32_t to ensure the input has the proper width.) （最后，请注意，我专门使用了uint32_t来确保输入具有正确的宽度。）

I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).我倾向于在项目中使用这个库，直到我需要一些更重的东西用于其他目的（此时我通常会切换到 ICU）。

Answer 3

C++17 has deprecated number of convenience functions processing utf. C++17 已弃用大量处理 utf 的便利函数。 Unfortunately, the last remaining ones will be deprecated in C++20 ^(*) .不幸的是，最后剩下的那些将在 C++20 ^{(*) 中}被弃用。 That being said std::codecvt is still valid.话虽如此， std::codecvt仍然有效。 From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t> , starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t> .从 C++11 到 C++17，您可以使用std::codecvt<char32_t, char, mbstate_t> ，从 C++20 开始它将是std::codecvt<char32_t, char8_t, mbstate_t> 。

Here is some code converting a code point (up to 0x10FFFF) in utf8:下面是一些在 utf8 中转换代码点（最多 0x10FFFF）的代码：

// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
    std::locale loc("");
    const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
                   std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);

    std::mbstate_t state{{0}};

    const char32_t * last_in;
    char *last_out;
    std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
                                            buf, buf+sz, last_out);
    sz = last_out - buf;
    return res;
}

^(*) std::codecvt will still exist in C++20. ^(*) std::codecvt仍将存在于 C++20 中。 Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t> but std::codecvt<char16_t, char8_t, std::mbstate_t> and std::codecvt<char32_t, char8_t, std::mbstate_t> (note char8_t instead of char )简单地默认实例化将不再是std::codecvt<char16_t, char, std::mbstate_t>和std::codecvt<char32_t, char, std::mbstate_t>而是std::codecvt<char16_t, char8_t, std::mbstate_t>和std::codecvt<char32_t, char8_t, std::mbstate_t> （注意char8_t而不是char ）

Answer 4

After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb , which looks promising, and likely won't be deprecated any time soon在阅读了 C++ 中 UTF-8 支持的不稳定状态后，我偶然发现了相应的 C 支持c32rtomb ，它看起来很有希望，并且可能不会很快被弃用

#include <clocale>
#include <cuchar>
#include <climits>

size_t to_utf8(char32_t codepoint, char *buf)
{
    const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
    std::mbstate_t state{};
    std::size_t len = std::c32rtomb(buf, codepoint, &state);
    std::setlocale(LC_ALL, loc);
    return len;
}

Usage would then be用法将是

char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);

If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale of course.如果您的应用程序的当前语言环境已经是 UTF-8，您当然可以省略对setlocale的来回调用。

如何将代码点转换为 utf-8？

问题描述

4 个解决方案

解决方案1
5 2019-05-28 11:57:07

解决方案2
4 已采纳 2019-05-28 12:20:18

解决方案3
4 2019-05-28 13:30:42

解决方案4
1 2020-03-17 19:53:48

如何将代码点转换为 utf-8？

问题描述

4 个解决方案

解决方案1 5 2019-05-28 11:57:07

解决方案2 4 已采纳 2019-05-28 12:20:18

解决方案3 4 2019-05-28 13:30:42

解决方案4 1 2020-03-17 19:53:48

解决方案1
5 2019-05-28 11:57:07

解决方案2
4 已采纳 2019-05-28 12:20:18

解决方案3
4 2019-05-28 13:30:42

解决方案4
1 2020-03-17 19:53:48