無符號整數作為UTF-8值

Question

假設我有

uint32_t a(3084);

我想創建一個存儲Unicode字符的字符串U+3084 ，這意味着我應該采取的值a ，並用它作為坐標在UTF8表/字符集正確的漢字。

現在，顯然std::to_string()對我不起作用，標准中有很多函數可以在數值和char之間進行轉換，我找不到任何可以提供UTF8支持並輸出std::string 。

我想問一下是否必須從頭開始創建此函數，或者C ++ 11標准中有什么可以幫助我實現這一點； 請注意我的編譯器（gcc / g ++ 4.8.1）沒有完全支持codecvt 。

Answer 1

以下是一些C ++代碼，這些代碼很難轉換為C。改編自較早的答案。

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}

Answer 2

std :: string_convert :: to_bytes僅適合您一個單字符重載。

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    uint32_t a(3084);

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
    std::string u8str = conv1.to_bytes(a);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
}

我得到了（使用libc ++）

$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c

Answer 3

C ++標准包含std::codecvt<char32_t, char, mbstate_t>構面，該構面根據22.4.1.4 [locale.codecvt]段落3在UTF-32和UTF-8之間進行轉換。可悲的是， std::codecvt<...>方面不容易使用。在某個時候，有人在討論有關過濾流緩沖區的問題，這種情況將在代碼轉換的情況下進行（標准C ++庫無論如何都需要為std::basic_filebuf<...>實現它們），但我看不到它們的任何痕跡。

Answer 4

auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;

版畫

や

對我來說（使用g ++ 4.8.1）。 如您所料， s類型為const char* ，但是我不知道這是否是實現定義的。 不幸的是，據我所知，C ++不支持UTF8字符串的操作。 為此，您需要使用類似Glib::ustring的庫。

無符號整數作為UTF-8值

問題描述

4 個解決方案

解決方案1
7 已采納 2013-11-14 03:30:29

解決方案2
5 2013-11-14 03:38:23

解決方案3
1 2013-11-14 03:21:58

解決方案4
0 2013-11-14 03:19:15

無符號整數作為UTF-8值

問題描述

4 個解決方案

解決方案1 7 已采納 2013-11-14 03:30:29

解決方案2 5 2013-11-14 03:38:23

解決方案3 1 2013-11-14 03:21:58

解決方案4 0 2013-11-14 03:19:15

解決方案1
7 已采納 2013-11-14 03:30:29

解決方案2
5 2013-11-14 03:38:23

解決方案3
1 2013-11-14 03:21:58

解決方案4
0 2013-11-14 03:19:15