简体   繁体   English

写一个包含非ASCII字符的字符串-仅当string是变量时才出错吗?

[英]Write a string with non-ASCII characters in it - error only if string is a variable?

I'm trying to write strings with non-ASCII characters in it to a file, such as "maçã", "pé", and so on. 我正在尝试将其中包含非ASCII字符的字符串写入文件,例如“maçã”,“pé”等。

I'm currently doing something like this: 我目前正在做这样的事情:

_setmode(_fileno(stdout), _O_U16TEXT);

//I added the line above recently to the question,
//but it was in the code before, I forgot to write it
//I also included some header files, to be able to do that
//can't really remember which, if necessary I'll look it up.


wstring word=L"";
wstring file = L"example_file.txt"
vector<wstring> my_vector;

wofstream my_output(file);

while(word != L".")
{
 getline(wcin, word);
 if(word!= L".")
   my_vector.pushback(word);
}

for(std::vector<wstring>::iterator j=my_vector.begin(); j!=my_vector.end(); j++)
    {
        my_output << *j << endl;
//element pointed by iterator going through the whole vector

        my_output << L("maçã pé") << endl;
    }
my_output.close();

Now, if I enter "maçã", "pé" and "." 现在,如果我输入“maçã”,“pé”和“。” as words (only the 1st two are stored in the vector), the output to the file is rather strange: 作为单词(向量中仅存储前两个),文件的输出相当奇怪:

  • the words I entered (stored in variables) appear strange: "ma‡Æ" and "p,"; 我输入的单词(存储在变量中)看起来很奇怪:“ ma‡Æ”和“ p”;
  • the words stored directly in the code appear perfectly normal "maçã pé"; 直接存储在代码中的单词看起来完全是正常的“maçãpé”;

I have tried using wcin >> word instead of getline(wcin, word) and writing to the console instead of a file, the results are the same: writes variable strings wrong, writes strings directly in code perfectly. 我尝试使用wcin >> word代替getline(wcin, word)并写入控制台而不是文件,结果是相同的:错误地写入变量字符串,直接在代码中完美地写入字符串。

I cannot find a reason for this to happen, so any help will be greatly appreciated. 我找不到发生这种情况的原因,因此我们将不胜感激。

Edit: I am working in Windows 7, using Visual C++ 2010 编辑:我正在Windows 7中使用Visual C ++ 2010

Edit 2 : added one more line of code, that I had missed. 编辑2 :添加了我错过的另一行代码。 (right in the beginning) (刚开始时)

EDIT 3: following SigTerm's suggestion, I realised the problem is with the input: neither wcin nor getline are getting the strings with right formatting to variable wstring word . 编辑3:按照SigTerm的建议,我意识到问题出在输入:wcin和getline都没有以正确的格式将字符串获取为可变的wstring word So, the question is, do you know what is causing this or how to fix it? 因此,问题是,您是否知道是什么原因或如何解决?

Try to include 尝试包括

#include <locale>

and at the beginning of main, write 在main的开头,写

std::locale::global(std::locale(""));

Windows makes encodings confusing because the console typically uses an "OEM" code page, while GUI applications use an "ANSI" code page. Windows使编码混乱,因为控制台通常使用“ OEM”代码页,而GUI应用程序使用“ ANSI”代码页。 The each vary with the localized version of Windows used. 每个版本都随所用Windows的本地化版本而有所不同。 On US Windows, The OEM code page is 437 and the ANSI code page is 1252. 在美国Windows上,OEM代码页为437,而ANSI代码页为1252。

Keeping the above in mind, setting the streams to the locale being used fixes the problem. 记住以上几点,将流设置为所使用的语言环境即可解决此问题。 If working in the console, use the console's code page: 如果在控制台中工作,请使用控制台的代码页:

wcin.imbue(std::locale("English_United States.437"));
wcout.imbue(std::locale("English_United States.437"));

But keep in mind most code pages are single-byte encodings, so only understand 256 possible Unicode characters: 但是请记住,大多数代码页都是单字节编码,因此只能理解256个可能的Unicode字符:

wstring word;
wcin.imbue(std::locale("English_United States.437"));
wcout.imbue(std::locale("English_United States.437"));
getline(wcin, word);
wcout << word << endl;
wcout << L"maçã pé" << endl;

This returns on the console: 这将在控制台上返回:

maça pé
maça pé

Code page 437 doesn't contain ã . 代码页437不包含ã

You can use code page 1252 from the console if you: 如果满足以下条件,则可以从控制台使用代码页1252:

  • Issue chcp 1252 . 发出chcp 1252
  • Use a TrueType console font like Consolas or Lucida Console. 使用TrueType控制台字体,例如Consolas或Lucida Console。
  • Imbue the streams with English_United States.1252 instead. 改为使用English_United States.1252 States.125对流进行注入。

Writing to a file has similar issues. 写入文件有类似的问题。 If you view the file in Notepad, it uses the ANSI code page to interpret the bytes in the file. 如果您在记事本中查看文件,它将使用ANSI代码页来解释文件中的字节。 So even if a console app is using code page 437, Notepad will display the file incorrectly if written using the 437 code page. 因此,即使控制台应用程序正在使用代码页437,如果使用437代码页编写,记事本也会错误地显示文件。 Writing the file in code page 1252 doesn't help either, because the two code pages don't interpret the same set of Unicode code points. 在代码页1252中写入文件也无济于事,因为这两个代码页无法解释同一组Unicode代码点。 Some answers to this problem are to get a different file viewer such as Notepad++ or write the file in UTF-8 which supports all Unicode characters. 该问题的一些答案是使用其他文件查看器(例如Notepad ++)或使用支持所有Unicode字符的UTF-8写入文件。

You are having the opposite to the problem described here . 您与此处描述的问题相反。

The core reason is the same: characters in the "ASCII" 1 range 128-256 are less standardized than the characters in the range 32-127. 核心原因是相同的:“ ASCII” 1范围128-256中的字符标准化程度低于32-127范围中的字符。 Most Windows applications, whether they use "Unicode" or "ANSI" strings, use the same mapping between codes and characters as specified by Unicode. 大多数Windows应用程序,无论使用“ Unicode”还是“ ANSI”字符串,都使用Unicode指定的代码和字符之间的相同映射。 however, for mostly historical reasons, the console uses a separate map of codes-to-characters usually called the "codepage". 但是,出于历史原因,控制台使用了一个单独的代码-字符映射,通常称为“代码页”。 The exact table used depends of the language and configuration of Windows. 使用的确切表取决于Windows的语言和配置。 For US English computers, that's the OEM 437 Code Page . 对于美国英语计算机,这是OEM 437代码页

When you type ç in the console, you are really entering character code 135, because that's the code assigned to that character in the 437 code page used by the console. 当您在控制台中键入ç时,您实际上是在输入字符代码135,因为这是在控制台使用的437代码页中分配给该字符的代码。 The rest of Windows interprets that character code as described in the Unicode tables as character . Windows的其余部分将Unicode表中所述的字符代码解释为字符

You can use OemToChar ( documentation here ) to convert text entered via the console to the corresponding string in Unicode encoding. 您可以使用OemToChar此处的文档 )将通过控制台输入的文本转换为Unicode编码的相应字符串。

See my answer here for other background information. 有关其他背景信息,请参见此处


1 yes, this range is technical not ASCII, but close enough. 1是,该范围是技术性的,不是ASCII,但足够接近。 I'm also using the usual informal (and technically wrong) definition of Unicode throughout. 我还在整个过程中使用Unicode的通常的非正式(技术上是错误的)定义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM