简体   繁体   English

Windows C ++中的输入编码问题

[英]Input encoding issue in Windows C++

I am developing a simple console application with Visual Studio 2013 我正在使用Visual Studio 2013开发一个简单的console应用程序

int _tmain(int argc, _TCHAR* argv[])
{    
    std::wstring name;
    std::wcout << L"Enter your name: ";
    std::wcin >> name;
    std::wcout << L"Hello, " << name << std::endl;
    system("pause");
    return 0;
}

If I enter as input Ángel the application works well and the output is 如果我输入Ángel ,应用程序运行良好,输出为

Hello, Ángel

the problem is that If i put a breakpoint on 问题是如果我把断点放在上面

std::wcout << L"Hello, " << name << std::endl;

the Visual studio debugger shows Visual Studio调试器显示

+       name    L"µngel"    std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >

Although the output in console is correct in other part of the program I have a call to win32api function CopyFileW() and it always fails because the path has the substring Ángel and the substring passed to function is transformed to µngel 虽然控制台中的输出在程序的其他部分是正确的,但我调用win32api函数CopyFileW()并且它总是失败,因为路径有子串 Ángel并且传递给function的子串被转换为µngel

The problem is that Windows consoles are broken by default. 问题是Windows控制台默认是破坏的。

The problem arises from Windows using a different 8-bit codepage in console application than in Windows applications. 问题出在Windows在控制台应用程序中使用与Windows应用程序不同的8位代码页。 By default, in Western Windows versions, the default 8-bit codepage (called ANSI) is Windows-1252, while the console 8-bit codepage (called OEM) is CP850. 默认情况下,在西部Windows版本中,默认的8位代码页(称为ANSI)是Windows-1252,而控制台的8位代码页(称为OEM)是CP850。

Since your program doesn't know if it is reading from console or from a redirected file, it simply assumes ANSI input. 由于您的程序不知道它是从控制台读取还是从重定向文件读取,它只是假设ANSI输入。 But when you type Á , it is actually the codepoint from CP850 : 0xB5 . 但是当你输入Á ,它实际上是CP850的代码点: 0xB5 It is then interpreted using Windows-1252 as µ , that is Unicode characters U+00B5. 然后使用Windows-1252将其解释为µ ,即Unicode字符U + 00B5。 The funny thing is that when you print it into the console, the inverse transformation happens, and you see a Á again. 有趣的是,当你将它打印到控制台时,会发生逆变换,你再次看到一个Á Two wrongs make one right! 两个错误使一个正确!

But when you want to use that characters in a non-console context, it is actually a µ . 但是当你想在非控制台环境中使用那些字符时,它实际上是µ

You may think that you can convert from OEM to ANSI and then from ANSI to Unicode, and that would seem to work... until you run your program as: 您可能认为您可以从OEM转换为ANSI,然后从ANSI转换为Unicode,这似乎有效...直到您将程序运行为:

c:\> myprogram < input.txt

And you wrote that input.txt using notepad, so it is using ANSI, and then you are doing a conversion you do not need. 并且您使用记事本编写了input.txt ,因此它使用ANSI,然后您正在进行不需要的转换。

You say then that you could detect if you are reading the actual console or a redirection and do the OEM to ANSI conversion only when there is no redirect... until you do: 然后你说你可以检测你是在阅读实际的控制台还是重定向,只有在没有重定向时才进行OEM到ANSI的转换...直到你这样做:

c:\> echo Ángel | myprogram

And you are doing it wrong again! 而你又错了!

There are a lot of alternatives, but none of them works completely fine. 有很多替代方案,但它们都没有完全正常。 At least you should use a Unicode font and then a more normal codepage. 至少你应该使用Unicode字体,然后使用更普通的代码页。 Something like chcp 1252 to change the OEM codepage to match the ANSI one. chcp 1252这样的东西来改变OEM代码页以匹配ANSI代码页。 You can even configure it by default with a bit of registry foo: 您甚至可以使用一些注册表foo来默认配置它:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP=1252

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM