简体   繁体   English

在C ++中处理UTF-8

[英]Handling UTF-8 in C++

To find out if C++ is the right language for a project of mine, I wanna test the UTF-8 capabilities. 为了找出C ++是否适合我的项目,我想测试UTF-8功能。 According to references, I built this example: 根据参考,我构建了以下示例:

#include <string>
#include <iostream>

using namespace std;

int main() {
    wstring str;
    while(getline(wcin, str)) {
        wcout << str << endl;
        if(str.empty()) break;
    }

    return 0;
}

But when I type in an UTF-8 character, it misbehaves: 但是,当我输入UTF-8字符时,它的行为不正确:

$ > ./utf8 
Hello
Hello
für
f
$ >

Not only it doesn't print the ü , but also quits immediately. 它不仅不会打印ü ,而且会立即退出。 gdb told me there was no crash, but a normal exit, yet I find that hard to believe. gdb告诉我没有崩溃,但是正常退出,但是我很难相信。

Don't use wstring on Linux. 在Linux上不要使用wstring。

std::wstring VS std::string std :: wstring VS std :: string

Take a look at first answer. 看看第一个答案。 I'm sure it answers your question. 我确定它能回答您的问题。

  1. When I should use std::wstring over std::string? 什么时候应该在std :: string上使用std :: wstring?

On Linux? 在Linux上? Almost never (§). 几乎从不 (§)。

On Windows? 在Windows上? Almost always (§). 几乎总是 (§)。

The language itself has nothing to do with unicode or any other character coding. 语言本身与unicode或任何其他字符编码无关。 It is tied to operating system. 它与操作系统绑定。 Windows uses UTF16 for unicode support which implies using wide chars (16-bit wide chars) - wchar_t or std:wstring. Windows将UTF16用于Unicode支持,这意味着使用宽字符(16位宽字符)-wchar_t或std:wstring。 Each Win Api function operating with strings requires wide char input. 每个使用字符串运行的Win Api函数都需要宽字符输入。

But unix-based systems ie Mac OS X or Linux use UTF8. 但是基于Unix的系统(例如Mac OS X或Linux)使用UTF8。 Of course - it is only a matter of how you handle bytes in the array, so you can have UTF16 string stored in common C array or std:string container. 当然-这只是如何处理数组中的字节的问题,因此您可以将UTF16字符串存储在公共C数组或std:string容器中。 This is why you do not see any wstrings in cross-platform code; 这就是为什么您在跨平台代码中看不到任何字符串的原因。 instead all strings are handled as UTF8 and re-encoded when necessary to UTF16 (on windows). 而是将所有字符串都作为UTF8处理,并在必要时(在Windows上)重新编码为UTF16。

You have more options how to handle this a bit confusing stuff. 您还有更多选择来处理这些令人困惑的事情。 I personally do it as mentioned above - by strictly using UTF8 coding in all the application, re-encoding strings when interacting with Windows Api and directly using them on Mac OS X. For the win re-encoding I use great conversion helpers: 我亲自完成了上述操作-通过在所有应用程序中严格使用UTF8编码,在与Windows Api交互时重新编码字符串,并在Mac OS X上直接使用它们。对于成功的重新编码,我使用了出色的转换助手:

C++ UTF-8 Conversion Helpers (on MSDN, available under the Apache License, Version 2.0). C ++ UTF-8转换帮助器 (在MSDN上,根据Apache许可,版本2.0提供)。

You can also use cross-platform Qt String which defines conversion functions from UTF8 to/from UTF16 and other codings (ANSI, Latin...). 您还可以使用跨平台的Qt字符串,该字符串定义从UTF8到UTF16的转换函数以及其他编码(ANSI,Latin ...)。

So the answer above - on unix use always UTF8 (std::string, char), on Windows UTF16 (std::wstring, wchar_t) is true. 因此,以上答案-在Unix上始终使用UTF8(std :: string,char),在Windows UTF16(std :: wstring,wchar_t)上为true。

Remember that on startup of the main program, the "C" locale is selected as default. 请记住,在启动主程序时,默认选择“ C”语言环境。 You probably don't want this if you handle utf-8. 如果您处理utf-8,则可能不希望这样做。 Calling setlocale(LC_CTYPE, "") turns off this default, and you get whatever is defined in the environment (presumably a utf-8 locale). 调用setlocale(LC_CTYPE, "")将关闭此默认设置,您将获得环境中定义的任何内容(大概是utf-8语言环境)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM