简体   繁体   English

将C ++ UTF-16转换为char(Linux / Ubuntu)

[英]C++ UTF-16 to char conversion (Linux/Ubuntu)

I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. 我正在尝试为一个朋友提供一个项目,该项目本来应该是1H,现在已经3天了。 Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath. 不用说,我感到非常沮丧和生气;-) ooooouuuu ...我喘口气。

So the program written in C++ just read a bunch of file and process them. 因此,用C ++编写的程序仅读取一堆文件并进行处理。 The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). 问题是我的程序读取使用UTF-16编码的文件(因为文件包含用不同语言编写的单词),而ifstream的简单用法似乎不起作用(它读取并输出垃圾)。 It took me a while to realise that this was because the files were in UTF-16. 我花了一段时间才意识到这是因为文件位于UTF-16中。

Now I spent literally the whole afternoon on the web trying to find info about READING UTF16 files and converting the content of a UTF16 line to char! 现在,我几乎整个下午都在网络上度过,以查找有关读取UTF16文件并将UTF16行的内容转换为char的信息! I just can't seem to! 我似乎无法! It's a nightmare. 这是一场噩梦。 I try to learn about <locale> and <codecvt> , wstring, etc. which I have never used before (I am specialised in graphics apps, not desktop apps). 我尝试了解以前从未使用过的<locale><codecvt> ,wstring等(我专门研究图形应用程序,而不是台式机应用程序)。 I just can't get it. 我就是不明白。

This is what I have done so fare (but doesn't work): 这是我到目前为止所做的事情(但不起作用):

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

That's the maximum I could come up with but it doesn't even work. 这是我能想到的最大值,但它甚至不起作用。 And it doesn't do anything better. 它并没有做得更好。 But the problem is that I don't understand what I am doing in the first place anyway. 但是问题是我一开始还是不明白我在做什么。

SO PLEASE PLEASE HELP! 所以请帮助! This is really driving crazy that I can even read a G*** D*** text file. 我什至可以读取G *** D ***文本文件,这真是太疯狂了。

On top, my friend uses Ubuntu (I use clang++) and this code needs -stdlib=libc++ which doesn't seem to be supported by gcc on his side (even though he uses a pretty advanced version of gcc, which is 4.6.3 i believe). 最重要的是,我的朋友使用Ubuntu(我使用clang ++),此代码需要-stdlib = libc ++,尽管他使用的是gcc的高级版本,但该版本似乎不受gcc支持(即使他使用的是gcc的高级版本,即4.6.3)。我相信)。 So I am not even sure using codecvt and locale is a good idea (as in "possible"). 因此,我什至不确定使用codecvt和locale是一个好主意(如“可能”中所述)。 Would there be a better (another) option. 会有更好的选择。

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information? 如果仅从命令行(使用linux命令)将所有文件转换为utf-8,我是否有可能会丢失信息?

Thank a lot, I will ever be grateful to you if you help me on this. 非常感谢,如果您能帮助我,我将不胜感激。

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information? 如果仅从命令行(使用linux命令)将所有文件转换为utf-8,我是否有可能会丢失信息?

No, all UTF-16 data can be losslessly converted to UTF-8. 不可以,所有UTF-16数据都可以无损地转换为UTF-8。 This is probably the best thing to do. 这可能是最好的事情。


When wide characters were introduced they were intended to be a text representation used exclusively internal to a program, and never written to disk as wide characters. 当引入宽字符时,它们旨在作为文本表示形式,仅在程序内部使用,而不会以宽字符形式写入磁盘。 The wide streams reflect this by converting the wide characters you write out to narrow characters in the output file, and converting narrow characters in a file to wide characters in memory when reading. 宽流通过将您写出的宽字符转换为输出文件中的窄字符,并在读取时将文件中的窄字符转换为内存中的宽字符来反映这一点。

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

Of course the actual encoding depends on the codecvt facet in the stream's imbued locale, but what the stream does is use the codecvt to convert from wchar_t to char using that facet when writing, and convert from char to wchar_t when reading. 当然,实际的编码依赖于codecvt在流的灌输区域小,但什么流确实是使用codecvt要转换wchar_tchar书写时使用方面,从转换charwchar_t阅读时。


However since some people started writing files out in UTF-16 other people have just had to deal with it. 但是,自从有人开始使用UTF-16写入文件以来,其他人就不得不处理它。 The way they do that with C++ streams is by creating codecvt facets that will treat char as holding half a UTF-16 code unit, which is what codecvt_utf16 does. 他们使用C ++流执行此操作的方式是创建codecvt构面,该构面会将char视为持有UTF-16代码单元的一半,这是codecvt_utf16所做的。

So with that explaination, here are the problems with your code: 因此,有了这些解释,这就是您的代码存在的问题:

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

Here's one way to rewrite the above: 这是重写上面内容的一种方法:

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}

I adapted, corrected and tested Mats Petersson's impressive solution. 我改编,纠正和测试了Mats Petersson令人印象深刻的解决方案。

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif

UTF-8 is capable of representing all valid Unicode characters (code-points), which is better than UTF-16 (which covers the first 1.1 million code-points). UTF-8能够表示所有有效的Unicode字符(代码点),这比UTF-16(涵盖前110万个代码点)要好。 [Although, as the comment explains, there is no valid Unicode code-points that are beyond the 1.1 million value, so UTF-16 is "safe" for all currently available code-points - and probably for a long time to come, unless we do get extra terrestrial visitors that have a very complex writing language...] [尽管如评论所言,没有有效的Unicode代码点,其值不超过11​​0万,因此,对于所有当前可用的代码点,UTF-16都是“安全的”,并且可能在很长的一段时间内,除非我们确实获得了具有非常复杂的写作语言的地面访客...]

It does this by, when necessary, using multiple bytes/words to store a single code-point (what we'd call a character). 它通过在必要时使用多个字节/字来存储单个代码点(称为字符)来实现。 In UTF-8, this is marked by the highest bit being set - in the first byte of a "multibyte" character, the top two bits are set, and in the following byte(s) the top bit is set, and the next from the top is zero. 在UTF-8中,这是通过设置最高位来标记的-在“多字节”字符的第一个字节中,设置最高的两位,在随后的字节中,设置最高位,而下一个字节从顶部开始为零。

To convert an arbitrary code-point to UTF-8, you can use the code in a previous answer from me. 要将任意代码点转换为UTF-8,您可以使用我先前提供的答案中的代码。 (Yes, that question talks about the reverse of what you are asking for, but the code in my answer covers both directions of conversion) (是的,这个问题是关于您要求的反向问题,但是我答案中的代码涵盖了转换的两个方向)

Converting from UTF16 to "integer" will be a similar method, except for the length of the input. 从UTF16转换为“整数”将是一种类似的方法,除了输入的长度。 If you are lucky, you can perhaps even get away with not doing that... 如果幸运的话,您甚至可以不这样做而逃脱...

UTF16 uses the range D800-DBFF as a first part, which holds 10 bits of data, and then the following item is DC00-DFFF, holding the following 10 bits of data. UTF16使用范围D800-DBFF作为第一部分,该部分保存10位数据,然后以下项是DC00-DFFF,保存以下10位数据。

Code for 16-bit to follow... 跟随16位代码...

Code for 16-bit to 32-bit conversion (I have only tested this a little bit, but it appears to work OK): 用于从16位到32位转换的代码(我只测试了一点,但是看起来可以正常工作):

std::vector<int> utf32_to_utf16(int charcode)
{
    std::vector<int> r;
    if (charcode < 0x10000)
    {
    if (charcode & 0xFC00 == 0xD800)
    {
        std::cerr << "Error bad character code" << std::endl;
        exit(1);
    }
    r.push_back(charcode);
    return r;
    }
    charcode -= 0x10000;
    if (charcode > 0xFFFFF)
    {
    std::cerr << "Error bad character code" << std::endl;
    exit(1);
    }
    int coded = 0xD800 | ((charcode >> 10) & 0x3FF);
    r.push_back(coded);
    coded = 0xDC00 | (charcode & 0x3FF);
    r.push_back(coded);
    return r;
}


int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF) | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM