在 Mac 中使用 C++ 读取 UTF-8 数据不起作用

Question

Although my C++ experience is quite reduced, I am trying to help a C++ programmer to have his library working on a Mac.虽然我的 C++ 经验相当少，但我正在努力帮助 C++ 程序员让他的库在 Mac 上运行。 At the moment, the problem seems to be only locale/encoding related.目前，问题似乎仅与区域设置/编码有关。

Trying to create a minimal working example, I tested the following code, that reads a line of UTF-8 Characters, to a wide string (wstring) and then goes through the string and prints each character.为了创建一个最小的工作示例，我测试了以下代码，该代码将一行 UTF-8 字符读取到一个宽字符串 (wstring)，然后遍历该字符串并打印每个字符。

While it works perfectly on a Linux box, having all characters printed in a different line, when using a Mac box I get each byte printed per line (and not each character).虽然它在 Linux 机器上完美运行，将所有字符打印在不同的行中，但在使用 Mac 机器时，我会每行打印每个字节（而不是每个字符）。

The code is:代码是：

#include <sstream>
#include <iostream> 
#include <string>
#include <boost/locale.hpp>

using namespace std;

int main() {
    std::ios_base::sync_with_stdio(false);
    boost::locale::generator gen;
    locale mylocale = gen("pt_PT.UTF-8");
    locale::global(mylocale);

    wstring userInput;
    getline(wcin, userInput);

    wcerr << "Size of string is " << userInput.length() << endl;

    for (int i = 0; i < userInput.length(); ++i) {
        wcerr << userInput.at(i) << endl;
    }
    return 0;
}

and my testing string is a stupid Portuguese sentence:我的测试字符串是一个愚蠢的葡萄牙语句子：

O coração é um órgão frágil.

I am trying with Boost_locale because somebody told me it was he way to get unicode working correctly on a Mac, but I would be happy to have a solution using only the C++ standard libraries.我正在尝试使用 Boost_locale，因为有人告诉我这是让 unicode 在 Mac 上正常工作的方法，但我很高兴有一个仅使用 C++ 标准库的解决方案。

EDIT:编辑：

The following code works on Mac.以下代码适用于 Mac。 It doesn't compile on my Linux box because of the codecvt include, but I can manage that with some CPP instructions.由于包含 codecvt，它不能在我的 Linux 机器上编译，但我可以使用一些 CPP 指令来管理它。

#include <sstream>
#include <iostream> 
#include <fstream>
#include <codecvt>
#include <locale>
#include <string>

using namespace std;

int main() {
    // setting std::local::global seems not to work (??)

    wcin.imbue(std::locale(locale(""), new std::codecvt_utf8<wchar_t>));
    wcerr.imbue(std::locale(locale(""), new std::codecvt_utf8<wchar_t>));

    wstring userInput;
    getline(wcin, userInput);

    wcerr << "Size of string is " << userInput.length() << endl;

    for (int i = 0; i < userInput.length(); ++i) {
        wcerr << userInput.at(i) << endl;
    }
    return 0;
}

Answer 1

This behavior is caused by the fact that in UTF-8 encoding a character, also known as a code point is represented by one or more code units .这种行为是由以下事实引起的：在 UTF-8 编码中，字符（也称为代码点）由一个或多个代码单元表示。

Essentially the:本质上是：

for (int i = 0; i < userInput.length(); ++i)

loops through code units .循环遍历代码单元。 You can verify that behavior by the fact that userInput.length() is a number greater than the number of characters in your string.您可以通过userInput.length()是一个大于字符串中字符数的数字来验证该行为。

By doing:通过做：

wcerr << userInput.at(i) << endl;

You are appending an endl after each code unit and thus separating code units that belong to the same code point which produces invalid characters.要附加一个endl每个代码单元之后，从而分离属于产生无效字符相同的代码点的代码单元。

If you instead just output:如果你只是输出：

wcerr << userInput << endl;

You will get your string intact.你会得到你的字符串完好无损。

If you want to output each character separately you will have to take into account multiple code units that belong to the same code point and output them separately.如果要分别输出每个字符，则必须考虑属于同一代码点的多个代码单元并分别输出。

UPDATE:更新：

wcin doesn't do the conversion to code points by default.默认情况下， wcin不会转换为代码点。 You need to explicitly state the encoding of the input and convert it.您需要明确说明输入的编码并对其进行转换。 This is essentially what the following code does.这基本上就是下面的代码所做的。 The only major difference with your example is that I used the C++11 standard library instead of Boost .与您的示例唯一的主要区别是我使用了C++11标准库而不是Boost 。

#include <codecvt>
#include <iostream>

int main() {

    std::locale::global( std::locale( std::locale(""), new std::codecvt_utf8<wchar_t> ) );

    std::wcin.imbue( std::locale() );
    std::wcout.imbue( std::locale() );
    std::wcerr.imbue( std::locale() );

    std::wstring user_input;
    std::wcin >> user_input;

    for( int i = 0; i < user_input.length(); ++i ) {
        std::wcout << user_input[i] << std::endl;
    }

    // Converting characters to uppercase
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t>>( std::locale() );

    for( int i = 0; i < user_input.length(); ++i ) {
        std::wcout << f.toupper(user_input[i]) << std::endl; // f.tolower() for lowercase
    }

    return 0;
}

PS To compile that you will need to pass the C++11 standard flag. PS 要编译它，您需要传递C++11标准标志。

g++ -std=c++11 main.cpp

在 Mac 中使用 C++ 读取 UTF-8 数据不起作用

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-04-25 16:54:35

在 Mac 中使用 C++ 读取 UTF-8 数据不起作用

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-04-25 16:54:35

解决方案1
2 已采纳 2016-04-25 16:54:35