简体   繁体   English

UTF16转换因utfcpp失败

[英]UTF16 conversion failing with utfcpp

I have this bit of code below that I've written that uses utfcpp to convert from a utf16 encoded file to a utf8 string. 我在下面编写的这段代码中使用utfcpp将utf16编码的文件转换为utf8字符串。

I think I must be using it improperly, because the result isnt changing. 我认为我一定使用不当,因为结果没有改变。 The utf8content variable comes out with null characters ( \\0 ) every other character exactly like the uft16 that I put into it. utf8content变量每隔一个字符就带有空字符( \\0 ),就像我放入其中的uft16一样。

//get file content
string utf8content;
std::ifstream ifs(path);
vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());

//convert
if(!utf8::is_valid(utf16line.begin(), utf16line.end())){
    utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8content));
}

I found the location in the library that is doing the append, it treats everything in the first octet the same, whereas my thought is that it should handle 0's differently. 我发现正在执行附加操作的库中的位置,它会将第一个八位位组中的所有内容都视为相同,而我认为应该以不同的方式处理0。

From checked.h here is the append method (line 106). 来自checked.h的是append方法(第106行)。 This is called by utf16to8 (line 202). 这由utf16to8 (第202行)。 Notice that I added first part of the if, so that it skips the null chars in an attempt to fix the problem. 注意,我添加了if的第一部分,以便跳过空字符,以尝试解决问题。

template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result)
{
    if (!utf8::internal::is_code_point_valid(cp))
        throw invalid_code_point(cp);

    if(cp < 0x01)                 //<===I added this line and..
        *(result++);              //<===I added this line
    else if (cp < 0x80)                        // one octet
        *(result++) = static_cast<uint8_t>(cp);
    else if (cp < 0x800) {                // two octets
        *(result++) = static_cast<uint8_t>((cp >> 6)            | 0xc0);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    else if (cp < 0x10000) {              // three octets
        *(result++) = static_cast<uint8_t>((cp >> 12)           | 0xe0);
        *(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f)   | 0x80);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    else {                                // four octets
        *(result++) = static_cast<uint8_t>((cp >> 18)           | 0xf0);
        *(result++) = static_cast<uint8_t>(((cp >> 12) & 0x3f)  | 0x80);
        *(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f)   | 0x80);
        *(result++) = static_cast<uint8_t>((cp & 0x3f)          | 0x80);
    }
    return result;
}

I cant imagine that this is the solution however, simply removing the null chars from the string and why wouldnt the library have found this? 我无法想象这是解决方案,只是从字符串中删除null字符,为什么图书馆找不到这个? So clearly I'm doing something wrong. 显然我做错了。

So, my question is, what is wrong with the way that I'm implementing my utfcpp in the first bit of code? 因此,我的问题是,在第一部分代码中实现utfcpp的方式有什么问题? Is there some type conversion that I've done wrong? 我做错了一些类型转换吗?

My content is a UTF16 encoded xml file. 我的内容是UTF16编码的xml文件。 It seems to truncate the results at the first null character. 似乎在第一个空字符处截断了结果。

std::ifstream reads the file in 8bit char units. std::ifstream以8位char单位读取文件。 UTF-16 uses 16bit units instead. UTF-16改为使用16位单元。 So if you want to read the file and fill your vector with proper UTF-16 units, then use std::wifstream instead (or std::basic_ifstream<char16_t> or equivalent if wchar_t is not 16-bit on your platform). 因此,如果您想读取文件并使用正确的UTF-16单位填充向量,请改用std::wifstream (如果平台上的wchar_t不是16位,则使用std::basic_ifstream<char16_t>或同等功能)。

And do no call utf8::is_valid() here. 并且不要在此处调用utf8::is_valid() It expects UTF-8 input but you have UTF-16 input instead. 它需要UTF-8输入,但是您可以使用UTF-16输入。

If sizeof(wchar_t) is 2: 如果sizeof(wchar_t)为2:

std::wifstream ifs(path);
std::istreambuf_iterator<wchar_t> ifs_begin(ifs), ifs_end;
std::wstring utf16content(ifs_begin, ifs_end);
std::string utf8content;

try {
    utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
    // bad UTF-16 data!
}

Otherwise: 除此以外:

// if char16_t is not available, use unit16_t or unsigned short instead

std::basic_ifstream<char16_t> ifs(path);
std::istreambuf_iterator<char16_t> ifs_begin(ifs), ifs_end;
std::basic_string<char16_t> utf16content(ifs_begin, ifs_end);
std::string utf8content;

try {
    utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
    // bad UTF-16 data!
}

The problem is where you're reading the file: 问题是您正在读取文件的位置:

vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());

This line is taking a char iterator and using it to fill a vector one byte at a time. 这行代码使用一个char迭代器,并使用它一次一次填充一个字节的向量。 You're essentially casting each byte instead of reading two bytes at a time. 您实际上是在投射每个字节,而不是一次读取两个字节。

This is breaking each UTF-16 entity into two pieces, and for much of your input one of those two pieces will be a null byte. 这会将每个UTF-16实体分为两部分,对于您的大部分输入而言,这两部分中的一个将为空字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM