在 C++ 中解释 UTF-8 Unicode 字符串

Question

Currently coding in C++20 using WSL2 Ubuntu, G++.目前使用 WSL2 Ubuntu、G++ 以 C++20 编码。

If I had a .txt file consisting of utf-8 unicode characters:如果我有一个由 utf-8 unicode 字符组成的 .txt 文件：

▄  ▄ ▄▄▄ ▄   ▄   ▄▄▄▄  ▄▄  ▄   ▄ ▄▄▄

How can I get the length (number of unicode characters) of this unicode string?如何获取此 unicode 字符串的长度（unicode 字符数）？
How can I read the file content and print out the unicode string?如何读取文件内容并打印出 unicode 字符串？

Answer 1

Assumptions:假设：

stdout supports UTF-8 (on Windows you can get by with chcp 65001 at the cmd prompt) stdout支持 UTF-8（在 Windows 上，您可以在 cmd 提示符下使用chcp 65001 ）
We're counting Unicode code points, not glyphs made up of multiple code points.我们计算的是 Unicode 代码点，而不是由多个代码点组成的字形。

UTF-8 encoding consists of start bytes following the bit patterns: UTF-8 编码由遵循位模式的起始字节组成：

0xxxxxxx (single byte encoding) 0xxxxxxx （单字节编码）
110xxxxx (two-byte encoding) 110xxxxx （两字节编码）
1110xxxx (three-byte encoding) 1110xxxx （三字节编码）
11110xxx (four-byte encoding) 11110xxx （四字节编码）

Follow-on bytes use 10xxxxxx as a bit pattern.后续字节使用10xxxxxx作为位模式。

UTF-8 can be read using std::string and the bytes processed accordingly.可以使用std::string读取 UTF-8 并相应地处理字节。

Demo code:演示代码：

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main() {
    ifstream f("input.txt");
    string s;
    getline(f,s);
    cout << "string: " << s << endl;
    cout << "length(bytes): " << s.length() << endl;

    int codepoints = 0;
    for(auto b : s) {
        if((b & 0xC0) != 0x80) // not UTF-8 intermediate byte?
            ++codepoints;
    }

    cout << "length(code points): " << codepoints << endl;
}

Output:输出：

string: ▄  ▄ ▄▄▄ ▄   ▄   ▄▄▄▄  ▄▄  ▄   ▄ ▄▄▄
length(bytes): 72
length(code points): 36

在 C++ 中解释 UTF-8 Unicode 字符串

问题描述

1 个解决方案

解决方案1
0 2021-10-27 16:30:01

在 C++ 中解释 UTF-8 Unicode 字符串

问题描述

1 个解决方案

解决方案1 0 2021-10-27 16:30:01

解决方案1
0 2021-10-27 16:30:01