从C ++文件中读取Unicode字符

Question

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character. 我想逐个字符地读取Unicode文件（UTF-8），但是我不知道如何从一个文件中一个字符地读取。

Can anyone to tell me how to do that? 谁能告诉我该怎么做？

Answer 1

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. UTF-8与ASCII兼容，因此您可以像读取ASCII文件一样读取UTF-8文件。 The C++ way to read a whole file into a string is: 将整个文件读入字符串的C ++方法是：

#include <iostream>
#include <string>
#include <fstream>

std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());

The resultant string has characters corresponding to UTF-8 bytes. 结果字符串具有对应于UTF-8字节的字符。 you could loop through it like so: 您可以像这样循环遍历：

for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
    char nextChar = *i;
    // do stuff here.
}

Alternatively, you could open the file in binary mode , and then move through each byte that way: 或者，您可以以二进制模式打开文件，然后以这种方式遍历每个字节：

std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
    char nextChar;
    while (fs.good()) {
        fs >> nextChar;
        // do stuff here.
    }
}

If you want to do more complicated things, I suggest you take a peek at Qt . 如果您想做更复杂的事情，建议您看一看Qt 。 I've found it rather useful for this sort of stuff. 我发现它对于这种东西很有用。 At least, less painful than ICU , for doing largely practical things. 至少，在做很多实际的事情上，比ICU痛苦的少。

QFile file;
if (file.open("my_file.text") {
    QTextStream in(&file);
    in.setCodec("UTF-8")
    QString contents = in.readAll();

    return;
}

Answer 2

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description 首先，看看UTF-8如何编码字符： http : //en.wikipedia.org/wiki/UTF-8#Description

Each Unicode character is encoded to one or more UTF-8 byte. 每个Unicode字符都编码为一个或多个UTF-8字节。 After you read first next byte in the file, according to that table: 在读取文件中的下一个字节后，根据该表：

(Row 1) If the most significant bit is 0 ( char & 0x80 == 0 ) you have your character. （行1）如果最高有效位是0（ char & 0x80 == 0 ），则说明您具有字符。

(Row 2) If the three most significant bits are 110 ( char & 0xE0 == 0xc0 ), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); （第2行）如果三个最高有效位是110（ char & 0xE0 == 0xc0 ），则必须读取另一个字节，第一个UTF-8字节（110YYYyy）的第4,3,2位构成第一个字节Unicode字符（00000YYY）和下一个字节（10xxxxxx）的6个最低有效位的两个最低有效位组成了Unicode字符（yyxxxxxx）的第二个字节； You can do the bit arithmetic using shifts and logical operators of C/C++ easily: 您可以使用C / C ++的移位和逻辑运算符轻松地进行位算术：

UnicodeByte1 =   (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);

And so on... 等等...

Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string. 听起来有些复杂，但是如果您知道如何修改这些位以将它们放置在适当的位置以解码UTF-8字符串，这并不难。

Answer 3

In theory strlib.h has a function mblen which shell return length of multibyte symbol. 理论上，strlib.h具有mblen函数，该shell返回多字节符号的长度。 But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. 但是在我的情况下，它对于多字节符号的第一个字节返回-1，并继续一直返回。 So I write the following: 所以我写了以下内容：

{
    if(i_ch == nullptr) return -1;
    int l = 0;
    char ch = *i_ch;
    int mask = 0x80;
    while(ch & mask) {
        l++;
        mask = (mask >> 1);
    }
    if (l < 4) return -1;
    return l;
}

It's take less time than research how shell using mblen. 与研究如何使用mblen shell相比，花费的时间更少。

Answer 4

try this: get the file and then loop through the text based on it's length 试试看：获取文件，然后根据其长度循环遍历文本

Pseudocode: 伪代码：

String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
    String the_character = s[i].

    // TODO : Do your thing :o)
}

从C ++文件中读取Unicode字符

问题描述

4 个解决方案

解决方案1
3 2012-01-07 02:41:07

解决方案2
3 2012-01-07 15:32:05

解决方案3
1 2014-10-22 06:31:36

解决方案4
-2 2012-01-07 02:29:53

从C ++文件中读取Unicode字符

问题描述

4 个解决方案

解决方案1 3 2012-01-07 02:41:07

解决方案2 3 2012-01-07 15:32:05

解决方案3 1 2014-10-22 06:31:36

解决方案4 -2 2012-01-07 02:29:53

解决方案1
3 2012-01-07 02:41:07

解决方案2
3 2012-01-07 15:32:05

解决方案3
1 2014-10-22 06:31:36

解决方案4
-2 2012-01-07 02:29:53