使用C ++从文件中读取和打印UTF-8符号

Question

I've faced an issue and couldn't find an answer on the internet. 我遇到了一个问题，在互联网上找不到答案。 Even though I found many similar questions, none of the answers worked for me. 虽然我发现了许多类似的问题，但没有一个答案适合我。 I'm using Visual Studio 2015 on Windows 10. 我在Windows 10上使用Visual Studio 2015。

So part of my code is: 所以我的部分代码是：

wstring books[50];
wstring authors[50];
wstring genres[50];
wstring takenBy[50];
wstring additional;
bool taken[50];
_setmode(_fileno(stdout), _O_U8TEXT);
wifstream fd("bookList.txt");
i = 0;
while (!fd.eof())
{
    getline(fd, books[i]);
    getline(fd, authors[i]);
    getline(fd, genres[i]);
    getline(fd, takenBy[i]);
    fd >> taken[i];
    getline(fd, additional);
    i++;
}

What I need, is to read a text file encoded in UTF-8 with C++. 我需要的是用C ++读取用UTF-8编码的文本文件。 But, when I read the file, those wide strings are changed and when I print them, the output text is absolutely different. 但是，当我读取文件时，那些宽字符串会被更改，当我打印它们时，输出文本就完全不同了。

Input: 输入：

ąčę 高手

Output: 输出：

ÄÄÄ AAA

How do I avoid it and read the text correctly? 如何避免它并正确阅读文本？

Answer 1

UTF-8 is (probably) not in wide strings. UTF-8 （可能）不是宽字符串。 Read about UTF-8 everywhere . 随处了解UTF-8 。 UTF-8 use 8 bits bytes (sometimes several of them) to encode Unicode characters. UTF-8使用8位字节 （有时是几个字节）来编码Unicode字符。 So in C++ an unicode character is parsed from a sequence of 1 to 6 bytes (ie char -s). 所以在C ++中，unicode字符是从1到6个字节的序列（即char -s）中解析出来的。

You need some UTF-8 parser and the C11 or C++11 standards don't provide any. 您需要一些UTF-8解析器，而C11或C ++ 11标准不提供任何解析器。 So you need some external library. 所以你需要一些外部库。 Look into libunistring (which is a simple C UTF-8 parsing library) or something else ( Qt , POCO , Glib , ICU , ...). 查看libunistring （这是一个简单的C UTF-8解析库）或其他东西（ Qt ， POCO ， Glib ， ICU ，...）。 You could decide to parse and convert UTF-8 into wide UTF-32 (using u32string -s and char32_t ) and backwards, or you'll better decide to work internally in UTF-8 (using std::string and char ) 您可以决定解析并将UTF-8转换为宽UTF-32 （使用u32string -s和char32_t ）并向后转换，或者您最好决定在UTF-8内部工作（使用std::string和char ）

Hence you'll parse and print sequences of char -s (using UTF-8 encoding) and your program would use plain std::string -s and plain char -s (not std::wstring or wchar_t ) but process UTF-8 sequences ... 因此，你将解析和打印char -s序列（使用UTF-8编码），你的程序将使用普通的std::string -s和plain char -s（不是std::wstring或wchar_t ），但是处理UTF-8序列 ......

Answer 2

This is easy with Boost.Spirit : 使用Boost.Spirit很容易：

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>

using namespace boost::spirit;

int main()
{
    std::string in("ąčę");
    std::string out;
    qi::parse(in.begin(), in.end(), +unicode::char_, out);
    std::cout << out << std::endl;
}

The following example reads a sequence of tuples (book, authors, takenBy): 以下示例读取一系列元组（book，authors，takenBy）：

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_tuple.hpp>
#include <iostream>
#include <string>
#include <tuple>
#include <vector>

using namespace boost::spirit;

int main()
{
    std::string in("Book_1\nAuthors_1\nTakenBy_1\n"\
                   "Book ąčę\nAuthors_2\nTakenBy_2\n");
    std::vector<
        std::tuple<
            std::string, /* book */
            std::string, /* authors */
            std::string  /* takenBy */
        > 
    > out;
    auto ok = qi::parse(in.begin(), in.end(),
                        *(
                               +(unicode::char_ - qi::eol) >> qi::eol /* book */
                            >> +(unicode::char_ - qi::eol) >> qi::eol /* authors */
                            >> +(unicode::char_ - qi::eol) >> qi::eol /* takenBy */
                        ),
                        out);
    if(ok)
    {
        for(auto& entry : out)
        {
            std::string book, authors, takenBy;
            std::tie(book, authors, takenBy) = entry;
            std::cout << "book: "    << book    << std::endl
                      << "authors: " << authors << std::endl
                      << "takenBy: " << takenBy << std::endl;
        }
    }
}

It's only a demo using std::tuple and an unnamed parser, which is the third parameter of qi::parse . 它只是一个使用std::tuple和一个未命名的解析器的演示，它是qi::parse的第三个参数。 You can use a struct instead of the tuple to represent books, authors, genres and etc. The unnamed parser may be replaced by a grammar and you can read the content of the file into a string to be passed to qi::parse . 您可以使用结构而不是元组来表示书籍，作者，流派等。未命名的解析器可以用语法替换，您可以将文件的内容读入字符串以传递给qi::parse 。

使用C ++从文件中读取和打印UTF-8符号

问题描述

2 个解决方案

解决方案1
6 2017-07-02 16:43:18

解决方案2
2 2017-07-02 19:28:32

使用C ++从文件中读取和打印UTF-8符号

问题描述

2 个解决方案

解决方案1 6 2017-07-02 16:43:18

解决方案2 2 2017-07-02 19:28:32

解决方案1
6 2017-07-02 16:43:18

解决方案2
2 2017-07-02 19:28:32