如何在 C++ 中使用 UTF-8 和 Unicode？ C++20 char8_t 有多大？

Question

假設我想在 C++ 中存儲一個（不是std::string中的）Unicode 字符，我該怎么做？ char8_t是在 C++20 中引入的，但它似乎只是unsigned char的 typedef，最多只能存儲 1 個字節的信息。 一些字符（尤其是表情符號等更奇特的字符）一次最多可以占用 4 個字節。

不起作用的代碼示例：

char8_t smth = "😀";

有趣的是，盡管sizeof()說它有 8 個字節大，但這將起作用，我對此表示懷疑。

const char* smth = "😀";

Answer 1

Unicode vs UTF-8 vs UTF-32 vs char8_t vs char32_t

Unicode是基於 32 位無符號 integer 表示（ code point ）的字符的標准表示。 通過濫用語言我們也說“Unicode”來談論代碼點。 例如 Unicode （代碼點）是0x1F600 。

UTF-32是將 Unicode 代碼點編碼為 4 個字節（或 32 位）的簡單編碼。 這很簡單，因為您可以只存儲 32 位無符號 integer 的代碼點。

UTF-8是 Unicode 代碼點的編碼格式，能夠將它們存儲在 1 到 4 個 8 位數據塊中。 這是可能的，因為 Unicode 代碼點不使用所有 32 位，因此可以用 1 個字節（或 8 位）表示最常用的字符（~ASCII），用 2 到 4 個字節表示不常用的字符。

char8_t大致是一個 8 位的無符號 integer。 我說“大致”是出於（至少）兩個原因：第一個 c++ 標准規定它的大小至少為 8 位，但如果編譯器/系統如此決定，它可能會更多，其次它被認為是其獨特的類型，並且不是' 與 std::uint8_t 不完全相同（盡管從一個轉換到另一個是微不足道的）。

char32_t類似於char8_t ，除了它允許使用 32 位（因此它與std::uint32_t大致相當），這很方便，因為您可以使用它來存儲一個 Unicode 代碼點。

char(8_t) const*的情況

在 C++ 中，使用 c-string ( char(8_t) const* ) 時應小心。 它們的行為不像 object 但像一個指針，因此查詢其大小將返回指針之一（64 位處理器上為 8）。 使用以下代碼似乎更加愚蠢：

char8_t const* str = u"Hello";
sizeof(str); // == 8
sizeof(u"Hello"); // == 6 (5 letters + trailing 0x00)

使用適當的字符串文字

使用char （或char const*或std::string ）時要小心。 它不是用來存儲UTF-8編碼的字符串，而是存儲擴展的 ASCII。 因此，您的編譯器將不知道您正在嘗試編寫什么，並且可能不會按照您的預期進行。

char c0 = '😀';             // = '?' on Visual Studio (with 3 warnings)
char8_t c1 = u8'😀';        // Compilation error: trying to store 4 char8_t in 1
char32_t c2 = U'😀';        // = 😀 (or 128512)

char const* s0 = "😀";      // = "??" on Visual Studio (with 1 warning)
char8_t const* s1 = "😀";   // = "😀" stored on 4 bytes (0xf0, 0x9f, 0x98, 0x80), or "ðŸ˜€"
char32_t const* s2 = U"😀"; // = "😀" stored like the 4 bytes unsigned integer 128512

sizeof("😀");               // = 3: 2 bytes for 😀 (not sure why) + 1 byte for 0x00
sizeof(u8"😀");             // = 5: 4 bytes for 😀 + 1 byte for 0x00
sizeof(U"😀");              // = 8: 4 bytes for 😀 + 4 bytes for 0x00

存儲一個 Unicode / Unicode 字符

正如 Igor 所說，可以通過使用char32_t來存儲 1 個 Unicode字符。 但是，如果您想存儲代碼本身（整數），您可以存儲一個std::uint32_t 。 這兩種表示對於編譯器和語義都是不同的，所以請注意。 大多數時候使用 char32_t 會更明確，更不容易出錯。

char32_t c = U'😀';
std::uint32_t u = 0x1F600u; // it's funny because 'u' stands for unsigned here..

存儲一串 Unicode 字符

但是，如果您想存儲一串 Unicode 字符，您有多種選擇。 您首先想知道的是您的程序的約束是什么，與它交互的其他系統等等。

使用 char32_t

如果您需要不斷添加/刪除字符或檢查 Unicode（例如，如果您需要從字體在屏幕上繪制字符）並且 - 這非常重要 - 如果您沒有強大的 memory 約束 + 你沒有與使用普通字符串存儲UTF-8字符的（較舊的）庫接口，您可以通過使用char32_t使用 UTF-32 表示的 go ：

std::size_t size = sizeof(U"😀Ö"); // = 12 -> 4 bytes for each character including trailing 0x00

char32_t const* cString = U"😀Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u32string string{ U"😀Ö" }; // .size() = 2

std::u32string_view stringView{ U"😀Ö" }; // .size() = 2

使用 char8_t

如果您受到 memory 的限制，並且無法為每個 Unicode 使用 32 位存儲（知道在大多數情況下，它將是ASCII字符，在UTF-8中只能用 8 位表示）（例如）使用char const* / std::string來存儲UTF-8編碼字符的庫，您可以決定通過使用char8_t來存儲在 UTF-8 中編碼的字符串：

std::size_t size = sizeof(u8"😀Ö");
// = 7 -> 4 bytes for the emoji (they are pretty uncommon so UTF-8 encodes them on 4 bytes)
//   + 2 bytes for the "Ö" (not as uncommon but not a -very common- ASCII)
//   + 1 byte for the trailing 0x00

char8_t const* cString = u8"😀Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u8string string{ u8"😀Ö" }; // .size() = 6 (string's size method doesn't count the 0x00)

std::u8string_view stringView{ u8"😀Ö" }; // .size() = 6

使用char8_t的技巧是，從技術上講，您的計算機不知道它是以UTF-8編碼的（好吧，您的編譯器會知道並為您編碼“Ö”），它只知道您正在存儲 8 位長的東西代表字符，因此當您詢問這些字符串的大小時，為什么它不返回“2”。 如果您需要知道代表多少個 Unicode（或者您必須在屏幕上繪制多少個字符），您需要解碼此編碼。 它可能存在一些可以為您完成的精美庫，但這是我個人使用的（我根據 UTF-8 規范編寫）：

// How many char8_t of this string you need to read to get 1 Unicode. The trick here 
// is that it can be done using only the first char8_t of the string because of how
// UTF-8 encoding works. However this won't check for following bytes that could be
// erroneous.
constexpr std::size_t code_size(std::u8string_view a_string) noexcept
{
    auto const h0 = a_string[0] & 0b11110000;
    return h0 < 0b10000000 ? 1 : (h0 < 0b11100000 ? 2 : (h0 < 0b11110000 ? 3 : 4));
}

// How many char8_t you need to add to a string to encode this Unicode with UTF-8.
constexpr std::size_t code_size(char32_t const a_code) noexcept
{
    return a_code < 0x007f ? 1 : (a_code < 0x07ff ? 2 : (a_code < 0xffff ? 3 : 4));
}

// How many Unicode characters are stored in this UTF-8 encoded string.
constexpr std::size_t string_size(std::u8string_view a_string) noexcept
{
    auto size = 0ull;
    while (!a_string.empty())
    {
        auto const codeSize = code_size(a_string);
        if (codeSize > a_string.size())
        {
            return -1; // Error: this is not a valid UTF-8 encoded string.
        }
        size += codeSize;
        a_string = a_string.substr(codeSize);
    }
}

// Append the UTF-8 encoding of a code to an u8string.
template<typename TAllocator>
constexpr std::size_t write(
    char32_t a_code,
    std::basic_string<char8_t, std::char_traits<char8_t>, TAllocator>& a_output) noexcept
{
    if (a_code <= 0x007f)
    {
        a_output += static_cast<char8_t>(a_code);
        return 1;
    }
    else if (a_code <= 0x07ff)
    {
        a_output += static_cast<char8_t>(0b11000000 | ((a_code >> 6) & 0b00011111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 2;
    }
    else if (a_code <= 0xffff)
    {
        a_output += static_cast<char8_t>(0b11100000 | ((a_code >> 12) & 0b00001111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 3;
    }
    else
    {
        a_output += static_cast<char8_t>(0b11110000 | ((a_code >> 18) & 0b00000111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 12) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 4;
    }
}

// Read an Unicode from an UTF-8 encoded string view, effectively decreasing its size.
constexpr char32_t read(std::u8string_view& a_string)
{
    if (a_string.empty())
    {
        return 0x0000; // Null character
    }

    auto const codeSize = code_size(a_string);
    if (codeSize > a_string.size())
    {
        return 0xffff; // Invalid unicode
    }

    char8_t mask0 = codeSize < 2 ?
        0b1111111 : (codeSize < 3 ? 0b11111 : (codeSize < 4 ? 0b1111 : 0b111));
    char32_t unicode = mask0 & a_string[0];
    a_string = a_string.substr(1);

    constexpr char8_t mask = 0b00111111;
    for (auto i = 1u; i < codeSize; ++i)
    {
        if ((a_string[0] & ~mask) != 0b10000000)
        {
            return 0xffff; // Invalid unicode
        }
        unicode = (unicode << 6) | (mask & a_string[0]);
        a_string = a_string.substr(1);
    }
    
    return unicode;
}

如何在 C++ 中使用 UTF-8 和 Unicode？ C++20 char8_t 有多大？

問題描述

1 個解決方案

解決方案1
0 2021-11-07 04:48:17

如何在 C++ 中使用 UTF-8 和 Unicode？ C++20 char8_t 有多大？

問題描述

1 個解決方案

解決方案1 0 2021-11-07 04:48:17

解決方案1
0 2021-11-07 04:48:17