UTF16轉換因utfcpp失敗

Question

我在下面編寫的這段代碼中使用utfcpp將utf16編碼的文件轉換為utf8字符串。

我認為我一定使用不當，因為結果沒有改變。 utf8content變量每隔一個字符就帶有空字符（ \\0 ），就像我放入其中的uft16一樣。

//get file content
string utf8content;
std::ifstream ifs(path);
vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());

//convert
if(!utf8::is_valid(utf16line.begin(), utf16line.end())){
    utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8content));
}

我發現正在執行附加操作的庫中的位置，它會將第一個八位位組中的所有內容都視為相同，而我認為應該以不同的方式處理0。

來自checked.h的是append方法（第106行）。 這由utf16to8 （第202行）。 注意，我添加了if的第一部分，以便跳過空字符，以嘗試解決問題。

template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result)
{
    if (!utf8::internal::is_code_point_valid(cp))
        throw invalid_code_point(cp);

    if(cp < 0x01)                 //<===I added this line and..
        *(result++);              //<===I added this line
    else if (cp < 0x80)                        // one octet
        *(result++) = static_cast<uint8_t>(cp);
    else if (cp < 0x800) {                // two octets
        *(result++) = static_cast<uint8_t>((cp >> 6)            | 0xc0);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    else if (cp < 0x10000) {              // three octets
        *(result++) = static_cast<uint8_t>((cp >> 12)           | 0xe0);
        *(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f)   | 0x80);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    else {                                // four octets
        *(result++) = static_cast<uint8_t>((cp >> 18)           | 0xf0);
        *(result++) = static_cast<uint8_t>(((cp >> 12) & 0x3f)  | 0x80);
        *(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f)   | 0x80);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    return result;
}

我無法想象這是解決方案，只是從字符串中刪除null字符，為什么圖書館找不到這個？ 顯然我做錯了。

因此，我的問題是，在第一部分代碼中實現utfcpp的方式有什么問題？ 我做錯了一些類型轉換嗎？

我的內容是UTF16編碼的xml文件。 似乎在第一個空字符處截斷了結果。

Answer 1

std::ifstream以8位char單位讀取文件。 UTF-16改為使用16位單元。 因此，如果您想讀取文件並使用正確的UTF-16單位填充向量，請改用std::wifstream （如果平台上的wchar_t不是16位，則使用std::basic_ifstream<char16_t>或同等功能）。

並且不要在此處調用utf8::is_valid() 。 它需要UTF-8輸入，但是您可以使用UTF-16輸入。

如果sizeof(wchar_t)為2：

std::wifstream ifs(path);
std::istreambuf_iterator<wchar_t> ifs_begin(ifs), ifs_end;
std::wstring utf16content(ifs_begin, ifs_end);
std::string utf8content;

try {
    utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
    // bad UTF-16 data!
}

除此以外：

// if char16_t is not available, use unit16_t or unsigned short instead

std::basic_ifstream<char16_t> ifs(path);
std::istreambuf_iterator<char16_t> ifs_begin(ifs), ifs_end;
std::basic_string<char16_t> utf16content(ifs_begin, ifs_end);
std::string utf8content;

try {
    utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
    // bad UTF-16 data!
}

Answer 2

問題是您正在讀取文件的位置：

vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());

這行代碼使用一個char迭代器，並使用它一次一次填充一個字節的向量。 您實際上是在投射每個字節，而不是一次讀取兩個字節。

這會將每個UTF-16實體分為兩部分，對於您的大部分輸入而言，這兩部分中的一個將為空字節。

UTF16轉換因utfcpp失敗

問題描述

2 個解決方案

解決方案1
2 已采納 2014-01-17 23:11:40

解決方案2
1 2014-01-17 22:45:43

UTF16轉換因utfcpp失敗

問題描述

2 個解決方案

解決方案1 2 已采納 2014-01-17 23:11:40

解決方案2 1 2014-01-17 22:45:43

解決方案1
2 已采納 2014-01-17 23:11:40

解決方案2
1 2014-01-17 22:45:43