[英]Interpreting UTF-8 unicode strings in c++
Currently coding in C++20 using WSL2 Ubuntu, G++.目前使用 WSL2 Ubuntu、G++ 以 C++20 编码。
If I had a .txt file consisting of utf-8 unicode characters:如果我有一个由 utf-8 unicode 字符组成的 .txt 文件:
▄ ▄ ▄▄▄ ▄ ▄ ▄▄▄▄ ▄▄ ▄ ▄ ▄▄▄
How can I get the length (number of unicode characters) of this unicode string?如何获取此 unicode 字符串的长度(unicode 字符数)?
How can I read the file content and print out the unicode string?如何读取文件内容并打印出 unicode 字符串?
Assumptions:假设:
stdout
supports UTF-8 (on Windows you can get by with chcp 65001
at the cmd prompt) stdout
支持 UTF-8(在 Windows 上,您可以在 cmd 提示符下使用chcp 65001
)UTF-8 encoding consists of start bytes following the bit patterns: UTF-8 编码由遵循位模式的起始字节组成:
0xxxxxxx
(single byte encoding) 0xxxxxxx
(单字节编码)110xxxxx
(two-byte encoding) 110xxxxx
(两字节编码)1110xxxx
(three-byte encoding) 1110xxxx
(三字节编码)11110xxx
(four-byte encoding) 11110xxx
(四字节编码) Follow-on bytes use 10xxxxxx
as a bit pattern.后续字节使用10xxxxxx
作为位模式。
UTF-8 can be read using std::string
and the bytes processed accordingly.可以使用std::string
读取 UTF-8 并相应地处理字节。
Demo code:演示代码:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
ifstream f("input.txt");
string s;
getline(f,s);
cout << "string: " << s << endl;
cout << "length(bytes): " << s.length() << endl;
int codepoints = 0;
for(auto b : s) {
if((b & 0xC0) != 0x80) // not UTF-8 intermediate byte?
++codepoints;
}
cout << "length(code points): " << codepoints << endl;
}
Output:输出:
string: ▄ ▄ ▄▄▄ ▄ ▄ ▄▄▄▄ ▄▄ ▄ ▄ ▄▄▄
length(bytes): 72
length(code points): 36
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.