[英]Input encoding issue in Windows C++
I am developing a simple console
application with Visual Studio 2013 我正在使用Visual Studio 2013开发一个简单的console
应用程序
int _tmain(int argc, _TCHAR* argv[])
{
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
system("pause");
return 0;
}
If I enter as input Ángel
the application works well and the output is 如果我输入Ángel
,应用程序运行良好,输出为
Hello, Ángel
the problem is that If i put a breakpoint on 问题是如果我把断点放在上面
std::wcout << L"Hello, " << name << std::endl;
the Visual studio debugger shows Visual Studio调试器显示
+ name L"µngel" std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >
Although the output in console is correct in other part of the program I have a call to win32api
function CopyFileW()
and it always fails because the path has the substring Ángel
and the substring passed to function is transformed to µngel
虽然控制台中的输出在程序的其他部分是正确的,但我调用win32api
函数CopyFileW()
并且它总是失败,因为路径有子串 Ángel
并且传递给function的子串被转换为µngel
The problem is that Windows consoles are broken by default. 问题是Windows控制台默认是破坏的。
The problem arises from Windows using a different 8-bit codepage in console application than in Windows applications. 问题出在Windows在控制台应用程序中使用与Windows应用程序不同的8位代码页。 By default, in Western Windows versions, the default 8-bit codepage (called ANSI) is Windows-1252, while the console 8-bit codepage (called OEM) is CP850. 默认情况下,在西部Windows版本中,默认的8位代码页(称为ANSI)是Windows-1252,而控制台的8位代码页(称为OEM)是CP850。
Since your program doesn't know if it is reading from console or from a redirected file, it simply assumes ANSI input. 由于您的程序不知道它是从控制台读取还是从重定向文件读取,它只是假设ANSI输入。 But when you type Á
, it is actually the codepoint from CP850 : 0xB5
. 但是当你输入Á
,它实际上是CP850的代码点: 0xB5
。 It is then interpreted using Windows-1252 as µ
, that is Unicode characters U+00B5. 然后使用Windows-1252将其解释为µ
,即Unicode字符U + 00B5。 The funny thing is that when you print it into the console, the inverse transformation happens, and you see a Á
again. 有趣的是,当你将它打印到控制台时,会发生逆变换,你再次看到一个Á
。 Two wrongs make one right! 两个错误使一个正确!
But when you want to use that characters in a non-console context, it is actually a µ
. 但是当你想在非控制台环境中使用那些字符时,它实际上是µ
。
You may think that you can convert from OEM to ANSI and then from ANSI to Unicode, and that would seem to work... until you run your program as: 您可能认为您可以从OEM转换为ANSI,然后从ANSI转换为Unicode,这似乎有效...直到您将程序运行为:
c:\> myprogram < input.txt
And you wrote that input.txt
using notepad, so it is using ANSI, and then you are doing a conversion you do not need. 并且您使用记事本编写了input.txt
,因此它使用ANSI,然后您正在进行不需要的转换。
You say then that you could detect if you are reading the actual console or a redirection and do the OEM to ANSI conversion only when there is no redirect... until you do: 然后你说你可以检测你是在阅读实际的控制台还是重定向,只有在没有重定向时才进行OEM到ANSI的转换...直到你这样做:
c:\> echo Ángel | myprogram
And you are doing it wrong again! 而你又错了!
There are a lot of alternatives, but none of them works completely fine. 有很多替代方案,但它们都没有完全正常。 At least you should use a Unicode font and then a more normal codepage. 至少你应该使用Unicode字体,然后使用更普通的代码页。 Something like chcp 1252
to change the OEM codepage to match the ANSI one. 像chcp 1252
这样的东西来改变OEM代码页以匹配ANSI代码页。 You can even configure it by default with a bit of registry foo: 您甚至可以使用一些注册表foo来默认配置它:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP=1252
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.