简体   繁体   English

C++ Visual Studio 字符编码问题

[英]C++ Visual Studio character encoding issues

Not being able to wrap my head around this one is a real source of shame...无法将我的头环绕在这个周围是一种真正的耻辱......

I'm working with a French version of Visual Studio (2008), in a French Windows (XP).我在法语 Windows (XP) 中使用法语版本的 Visual Studio (2008)。 French accents put in strings sent to the output window get corrupted.发送到输出窗口的字符串中的法语口音会损坏。 Ditto input from the output window.同上输出窗口输入。 Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect.典型的字符编码问题,我输入ANSI,得到UTF-8作为回报,或者类似的东西。 What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?当向输出窗口显示“硬编码”字符串时,什么设置可以确保字符保留在 ANSI 中?

EDIT:编辑:

Example:例子:

#include <iostream>

int main()
{
std:: cout << "àéêù" << std:: endl;

return 0;
}

Will show in the output:将在输出中显示:

óúÛ¨

(here encoded as HTML for your viewing pleasure) (这里编码为 HTML 以供您观看)

I would really like it to show:我真的很想它显示:

àéêù àéêù

Before I go any further, I should mention that what you are doing is not c/c++ compliant.在我继续之前,我应该提到你正在做的事情不符合 c/c++ 标准。 The specification states in 2.2 what character sets are valid in source code. 规范在 2.2 中说明了源代码中哪些字符集是有效的。 It ain't much in there, and all the characters used are in ascii.它在那里并不多,并且所有使用的字符都是ascii。 So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).所以......下面的一切都是关于一个特定的实现(碰巧的是,美国语言环境机器上的 VC2008)。

To start with, you have 4 chars on your cout line, and 4 glyphs on the output.首先,您的cout行上有 4 个字符,输出上有 4 个字形。 So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.所以问题不在于 UTF8 编码,因为它会将多个源字符组合成更少的字形。

From you source string to the display on the console, all those things play a part:从源字符串到控制台上的显示,所有这些都起作用:

  1. What encoding your source file is in (ie how your C++ file will be seen by the compiler)您的源文件采用什么编码(即编译器将如何查看您的 C++ 文件)
  2. What your compiler does with a string literal, and what source encoding it understands您的编译器对字符串文字做了什么,以及它理解的源编码
  3. how your << interprets the encoded string you're passing in您的<<如何解释您传入的编码字符串
  4. what encoding the console expects控制台期望什么编码
  5. how the console translates that output to a font glyph.控制台如何将该输出转换为字体字形。

Now...现在...

1 and 2 are fairly easy ones. 1和2是相当容易的。 It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation.看起来编译器会猜测源文件的格式,并将其解码为其内部表示。 It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was.无论源编码是什么,它都会在当前代码页中生成字符串文字对应的数据块。 I have failed to find explicit details/control on this.我没有找到明确的细节/控制。

3 is even easier. 3更容易。 Except for control codes, << just passes the data down for char *.除了控制代码, <<只是将数据向下传递给 char *。

4 is controlled by SetConsoleOutputCP . 4 由SetConsoleOutputCP控制。 It should default to your default system codepage.它应该默认为您的默认系统代码页。 You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP )您还可以通过GetConsoleOutputCP找出您拥有的是哪一个(通过SetConsoleCP不同方式控制输入)

5 is a funny one. 5是一个有趣的。 I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows).我用CP1252(西欧,windows)敲了敲脑袋想弄清楚为什么我不能让é正确显示。 It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP).事实证明,我的系统字体没有该字符的字形,并且有用地使用了我的标准代码页的字形(大写 Theta,如果我不调用 SetConsoleOutputCP,我会得到相同的字形)。 To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).为了修复它,我不得不将我在控制台上使用的字体更改为 Lucida Console(一种真正的字体)。

Some interesting things I learned looking at this:我从中学到了一些有趣的事情:

  • the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )源代码的编码无关紧要,只要编译器可以解决(特别是,将其更改为 UTF8 并不会更改生成的代码。我的“é”字符串仍然使用 CP1252 编码为233 0
  • VC is picking a codepage for the string literals that I do not seem to control. VC 正在为我似乎无法控制的字符串文字选择代码页。
  • controlling what the console shows is more painful than what I was expecting控制控制台显示的内容比我预期的更痛苦

So... what does this mean to you ?所以……这对你来说意味着什么? Here are bits of advice:以下是一些建议:

  • don't use non-ascii in string literals.不要在字符串文字中使用非 ascii。 Use resources, where you control the encoding.使用资源,可以在其中控制编码。
  • make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.确保您知道您的控制台需要什么编码,并且您的字体具有代表您发送的字符的字形。
  • if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer.如果您想弄清楚在您的情况下使用的是什么编码,我建议将字符的实际值打印为整数。 char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0]对我来说确实显示了 233,这恰好是 CP1252 中的编码。

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850 .顺便说一句,如果您得到的是“ÓÚÛ¨”而不是您粘贴的内容,那么看起来您的 4 个字节在某处被解释为CP850

Try this:试试这个:

#include <iostream>
#include <locale>

int main()
{
 std::locale::global(std::locale(""));
 std::cout << "àéêù" << std::endl;

 return 0;
}

Because I was requested to, I'll do some necromancy.因为我被要求,我会做一些死灵法术。 The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different.其他答案来自 2009 年,但这篇文章仍然是我在 2018 年进行的搜索。今天的情况非常不同。 Also, the accepted answer was incomplete even back in 2009.此外,即使在 2009 年,接受的答案也不完整。

The Source Character Set源字符集

Every compiler (including Microsoft's Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.每个编译器(包括 Microsoft 的 Visual Studio 2008 及更高版本,gcc、clang 和 icc)都会毫无问题地读取以 BOM 开头的 UTF-8 源文件,并且 clang 不会读取除 UTF-8 之外的任何内容,因此带有 BOM 的 UTF-8是 C 和 C++ 源文件的最小公分母。

The language standard doesn't say what source character sets the compiler needs to support.语言标准没有说明编译器需要支持哪些源字符集。 Some real-world source files are even saved in a character set incompatible with ASCII.一些现实世界的源文件甚至以与 ASCII 不兼容的字符集保存。 Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. 2008 年的 Microsoft Visual C++ 支持带有字节顺序标记的 UTF-8 源文件,以及两种形式的 UTF-16。 Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.如果没有字节顺序标记,它会假设文件是​​用当前的 8 位代码页编码的,它始终是 ASCII 的超集。

The Execution Character Sets执行字符集

In 2012, the compiler added a /utf-8 switch to CL.EXE . 2012 年,编译器向CL.EXE添加了/utf-8开关。 Today, it also supports the /source-charset and /execution-charset switches, as well as /validate-charset to detect if your file is not actually UTF-8.今天,它还支持/source-charset/execution-charset开关,以及/validate-charset来检测您的文件是否实际上不是 UTF-8。 This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++. MSDN 上的这个页面有一个链接,指向关于每个版本的 Visual C++ 的 Unicode 支持的文档。

Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a' , and a execution wide-character set that determines the value of wide-character constants like L'é' .当前版本的 C++ 标准说编译器必须有一个执行字符集,它确定像'a'这样的字符常量的数值,以及一个执行宽字符集,它确定像L'é'这样的宽字符常量的值L'é'

To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them.对于语言律师来说,标准中对如何编码这些内容的要求很少,但 Visual C 和 C++ 设法打破了它们。 It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0' through '9' must be consecutive.它必须包含大约 100 个不能有负值的字符,并且数字'0''9'的编码必须是连续的。 Neither capital nor lowercase letters have to be, because they weren't on some old mainframes.大写和小写字母都不必是,因为它们不在一些旧的大型机上。 (That is, '0'+9 must be the same as '9' , but there is still a compiler in real-world use today whose default behavior is that 'a'+9 is not 'j' but '«' , and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. (也就是说, '0'+9必须与'9'相同,但是今天在实际使用中仍然有一个编译器,其默认行为是'a'+9不是'j'而是'«' ,这是合法的。)宽字符执行集必须包括基本执行集,并有足够的位来保存任何支持的语言环境的所有字符。 Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \\Uxxxxxxxx , but a compiler that didn't could claim to be complying with the standard.每个主流编译器都至少支持一种 Unicode 语言环境,并且可以理解用\\Uxxxxxxxx指定的有效 Unicode 字符,但是一个编译器不能声称符合该标准。

The way Visual C and C++ violate the language standard is by making their wchar_t UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t must be a fixed-width encoding. Visual C 和 C++ 违反语言标准的方式是将它们的wchar_t设为 UTF-16,当标准说wchar_t必须是固定宽度编码时,它只能将某些字符表示为代理对。 This is because Microsoft defined wchar_t as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API.这是因为微软在 1990 年代将wchar_t定义为 16 位宽,当时 Unicode 委员会发现 16 位对于整个世界来说是不够的,而且微软不会破坏 Windows API。 It does support the standard char32_t type as well.它也支持标准的char32_t类型。

UTF-8 String Literals UTF-8 字符串文字

The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory.这个问题引发的第三个问题是如何让编译器在内存中将字符串文字编码为 UTF-8。 You've been able to write something like this since C++11:从 C++11 开始,你已经能够写出这样的东西:

constexpr unsigned char hola_utf8[] = u8"¡Hola, mundo!";

This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM's Z-series mainframe compiler).无论源字符集是 UTF-8、UTF-16、Latin-1、CP1252 还是 IBM EBCDIC 1047(这是一个愚蠢的理论示例,但仍然,为了向后兼容,IBM 的 Z 系列大型机编译器的默认设置)。 That is, it's equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 } .也就是说,它相当于用{ 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }初始化数组。

If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:如果输入字符太不方便,或者如果您想区分表面相同的字符(例如空格和不间断空格或预组合和组合字符),您还可以使用通用字符转义:

constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";

You can use these regardless of the source character set and regardless of whether you're storing the literal as UTF-8, UTF-16 or UCS-4.无论源字符集如何,也无论您将文字存储为 UTF-8、UTF-16 还是 UCS-4,您都可以使用它们。 They were originally added in C99, but Microsoft supported them in Visual Studio 2015.它们最初是在 C99 中添加的,但 Microsoft 在 Visual Studio 2015 中支持它们。

Edit: As reported by Matthew, u8" strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8 or /source-charset:utf-8 /execution-charset:utf-8 . The sample code above works properly in 19.22.27905.编辑:据 Matthew 报道, u8"字符串在某些版本的 MSVC 中存在问题,包括 19.14。事实证明,即使您指定/utf-8/source-charset:utf-8 /execution-charset:utf-8 ,文字非 ASCII 字符也是如此/source-charset:utf-8 /execution-charset:utf-8 . 上面的示例代码在 19.22.27905 中正常工作。

There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes.还有另一种方法可以在 Visual C 或 C++ 2008 中执行此操作,但是:八进制和十六进制转义码。 You would have encoded UTF-8 literals in that version of the compiler with:您可以在该版本的编译器中对 UTF-8 文字进行编码:

const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";

I tried this code:我试过这个代码:

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::wstringstream wss;
    wss << L"àéêù";
    std::wstring s = wss.str();
    const wchar_t* p = s.c_str();
    std::wcout << ws.str() << std::endl;

    std::wofstream file("C:\\a.txt");
    file << p << endl;

    return 0;
}

The debugger showed that wss, s and p all had the expected values (ie "àéêù"), as did the output file.调试器显示 wss、s 和 p 都具有预期值(即“àéêù”),输出文件也是如此。 However, what appeared in the console was óúÛ¨.然而,控制台中出现的是óúÛ¨。

The problem is therefore in the Visual Studio console, not the C++.因此,问题出在 Visual Studio 控制台,而不是 C++。 Using Bahbar's excellent answer, I added:使用 Bahbar 的出色回答,我补充说:

    SetConsoleOutputCP(1252);

as the first line, and the console output then appeared as it should.作为第一行,然后控制台输出显示为它应有的样子。

//Save As Windows 1252
#include<iostream>
#include<windows.h>

int main()
{
    SetConsoleOutputCP(1252);
    std:: cout << "àéêù" << std:: endl;
}

Visual Studio does not supports UTF 8 for C++, but partially supports for C: Visual Studio 不支持 C++ 的 UTF 8,但部分支持 C:

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>

int main()
{
    SetConsoleOutputCP(65001);
    printf("àéêù\n");
}

Using _setmode() works ¹ and is arguably better than changing the codepage or setting a locale, since it'll actually make your program output in Unicode and thus will be consistent - no matter which codepage or locale are currently set.使用_setmode()工作¹并且可以说比更改代码页或设置语言环境更好,因为它实际上会使您的程序输出为 Unicode,因此将保持一致 - 无论当前设置的是哪个代码页或语言环境。

Example:例子:

#include <iostream>
#include <io.h>
#include <fcntl.h>

int wmain()
{
    _setmode( _fileno(stdout), _O_U16TEXT );
    
    std::wcout << L"àéêù" << std::endl;

    return 0;
}

Inside Visual Studio, make sure you set up your project for Unicode (Right-click *Project* -> Click *General* -> *Character Set* = *Use Unicode Character Set*). 在 Visual Studio 中,确保为 Unicode 设置项目(右键单击 *Project* -> 单击 *General* -> *Character Set* = *Use Unicode Character Set*)。

MinGW users: MinGW用户:

  1. Define both UNICODE and _UNICODE定义UNICODE_UNICODE
  2. Add -finput-charset=iso-8859-1 to the compiler options to get around this error: " converting to execution character set: Invalid argument "-finput-charset=iso-8859-1添加到编译器选项以解决此错误:“转换为执行字符集:无效参数
  3. Add -municode to the linker options to get around " undefined reference to `WinMain@16 " ( read more ).-municode添加到链接器选项以绕过“对`WinMain@16 的未定义引用”(阅读更多)。

**Edit:** The equivalent call to set unicode *input* is: `_setmode( _fileno(stdin), _O_U16TEXT );` **编辑:** 设置 unicode *input* 的等效调用是:`_setmode( _fileno(stdin), _O_U16TEXT);`

Edit 2: An important piece of information, specially considering the question uses std::cout .编辑 2:一条重要的信息,特别是考虑到该问题使用std::cout This is not supported.这不受支持。 The MSDN Docs states (emphasis mine): MSDN Docs指出(强调我的):

Unicode mode is for wide print functions (for example, wprintf) and is not supported for narrow print functions . Unicode 模式用于宽打印功能(例如 wprintf),不支持窄打印功能 Use of a narrow print function on a Unicode mode stream triggers an assert.在 Unicode 模式流上使用窄打印功能会触发断言。

So, don't use std::cout when the console output mode is _O_U16TEXT ;所以,当控制台输出模式为_O_U16TEXT时不要使用std::cout similarly, don't use std::cin when the console input is _O_U16TEXT .同样,当控制台输入是_O_U16TEXT时不要使用std::cin You must use the wide version of these facilities ( std::wcout , std::wcin ).您必须使用这些工具的广泛版本( std::wcoutstd::wcin )。
And do note that mixing cout and wcout in the same output is not allowed (but I find it works if you call flush() and then _setmode() before switching between the narrow and wide operations).并且请注意, wcout在同一输出中混合coutwcout (但我发现如果在窄操作和宽操作之间切换之前先调用flush()然后_setmode() ,它会起作用)。

Make sure you do not forget to change the console's font to Lucida Consolas as mentionned by Bahbar : it was crucial in my case (French win 7 64 bit with VC 2012).确保您不要忘记将控制台的字体更改为 Bahbar 提到的Lucida Consolas :这对我来说至关重要(French win 7 64 bit with VC 2012)。

Then as mentionned by others use SetConsoleOutputCP(1252) for C++ but it may fail depending on the available pages so you might want to use GetConsoleOutputCP() to check that it worked or at least to check that SetConsoleOutputCP(1252) returns zero.然后正如其他人所提到的,将 SetConsoleOutputCP(1252) 用于 C++,但它可能会失败,具体取决于可用页面,因此您可能想要使用 GetConsoleOutputCP() 来检查它是否有效或至少检查 SetConsoleOutputCP(1252) 是否返回零。 Changing the global locale also works (for some reason there is no need to do cout.imbue(locale()); but it may break some librairies!更改全局语言环境也有效(出于某种原因,无需执行 cout.imbue(locale());但它可能会破坏某些库!

In C , SetConsoleOutputCP(65001);在 C 中, SetConsoleOutputCP(65001); or the locale-based approach worked for me once I had saved the source code as UTF8 without signature (scroll down, the sans-signature choice is way below in the list of pages).或者当我将源代码保存为没有签名的 UTF8后,基于语言环境的方法对我有用(向下滚动,无签名选项在页面列表的下方)。

Input using SetConsoleCP(65001);使用 SetConsoleCP(65001)输入 failed for me apparently due to a bad implementation of page 65001 in windows.显然,由于在 Windows 中页面 65001 的实现不当,我失败了。 The locale approach failed too both in C and C++.语言环境方法在 C 和 C++ 中也失败了。 A more involved solution, not relying on native chars but on wchar_t seems required.一个更复杂的解决方案,不依赖于原生字符,而是依赖 wchar_t 似乎是必需的。

I had the same problem with Chinese input.我在输入中文时遇到了同样的问题。 My source code is utf8 and I added /utf-8 in the compiler option.我的源代码是 utf8,我在编译器选项中添加了 /utf-8。 It works fine under c++ wide-string and wide-char but not work under narrow-string/char which it shows Garbled character/code in Visual Studio 2019 debugger and my SQL database.它在 c++ 宽字符串和宽字符下运行良好,但在窄字符串/字符下不工作,它在 Visual Studio 2019 调试器和我的 SQL 数据库中显示乱码字符/代码。 I have to use the narrow characters because of converting to SQLAPI++'s SAString.由于转换为 SQLAPI++ 的 SAString,我必须使用窄字符。 Eventually, I find checking the following option (contorl panel->Region->Administrative->Change system locale) can resolve the issue.最终,我发现检查以下选项(控制面板->区域->管理->更改系统区域设置)可以解决问题。 I know it is not an ideal solution but it does help me.我知道这不是一个理想的解决方案,但它确实对我有帮助。

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM