简体   繁体   English

如何读取包含中文字符的UTF-8编码文件并在控制台上正确输出?

[英]How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?

I am writing a web crawler to fetch some Chinese web files. 我正在编写一个网络爬虫来获取一些中文网络文件。 The fetched files are encoded in utf-8. 获取的文件以utf-8编码。 And I need to read those file to do some parse, such as extracting the URLs and Chinese Characters. 我需要读取这些文件来进行一些解析,例如提取URL和中文字符。 But I found that when I read the file into a std::string variable and output it into the console, the Chinese characters became garbage characters. 但我发现当我将文件读入std :: string变量并将其输出到控制台时,中文字符变为垃圾字符。 I applied the boost::regex into the std::string variable and can extract all URLs but Chinese characters. 我将boost :: regex应用到std :: string变量中,并且可以提取除中文字符之外的所有URL。

How can I solves those problems? 我怎样才能解决这些问题?

PS My CPP files are encoded as ANSI by default, the operating system is Win8 in Chinese Language; PS我的CPP文件默认编码为ANSI,操作系统为Win8中文版;

This code may help (it was compiled with VC++ 2010). 此代码可能有所帮助(它是使用VC ++ 2010编译的)。 I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. 我用包含非拉丁字符的UTF-8文件测试它似乎工作,但我不知道它是否适用于中文字符。 Check the following links for more information: _setmode and codecvt_utf8 . 有关更多信息,请查看以下链接: _setmodecodecvt_utf8

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <fcntl.h>
#include <io.h>

using namespace std;    // Sorry for this!

void read_all_lines(const wchar_t *filename)
{
    wifstream wifs;
    wstring txtline;
    int c = 0;

    wifs.open(filename);
    if(!wifs.is_open())
    {
        wcerr << L"Unable to open file" << endl;
        return;
    }
    // We are going to read an UTF-8 file
    wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
    while(getline(wifs, txtline))
        wcout << ++c << L'\t' << txtline << L'\n';
    wcout << endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
    // Console output will be UTF-16 characters
    _setmode(_fileno(stdout), _O_U16TEXT);
    if(argc < 2)
    {
        wcerr << L"Filename expected!" << endl;
        return 1;
    }
    read_all_lines(argv[1]);
    return 0;
}

If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts). 如果中文字符看起来不像预期的那样,请确保控制台使用的是支持UTF-16的字体(即不使用位图字体)。

In general, use the w variants, ( wstring , wfstream , wcout ), set your locales to match the requirements, hang an L on the front of string literals. 通常,使用w变体( wstringwfstreamwcout ),设置您的语言环境以匹配要求,在字符串文字的前面挂一个L locale::global(locale("")) sets up to match the environment default, then on each stream that isn't running according to that default eg wcout.imbue(locale("Chinese_China.936")) might be Microsoft's name for your terminal's locale settings. locale::global(locale(""))设置为匹配环境默认值,然后在根据默认值运行的每个流上,例如wcout.imbue(locale("Chinese_China.936")) 可能是Microsoft的名称为您的终端的区域设置。 This has always been enough to do what I want, hope it works as well for you. 这总是足以做我想做的事情,希望它对你有用。

#include <iostream>
#include <locale>
using namespace std;
int main() {
  locale::global(locale(""));
  wstring word;
  while (wcin >>word)
    wcout<<word<<'\n';
  wcout<<L"好運n";
}

if you need to display characters correctly, you can use libiconv from GNU. 如果需要正确显示字符,可以使用GNU的libiconv。 if you only need to process urls, std::string works fine. 如果你只需要处理url,std :: string工作正常。 the problem is windows console's code page, not the string itself. 问题是Windows控制台的代码页,而不是字符串本身。 use locale depends on os and stdc++lib's implementation, so I don't encourage using . 使用locale取决于os和stdc ++ lib的实现,所以我不鼓励使用。

window's MultiByteToWideChar may help, but you need to check MS's specifications on how there functions perform conversions on strings. window的MultiByteToWideChar可能会有所帮助,但您需要检查MS关于函数如何执行字符串转换的规范。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM