简体   繁体   English

如何使用std :: ifstream读取UTF-8编码的文本文件?

[英]How to read UTF-8 encoded text file using std::ifstream?

I'm having a hard time to parse an xml file. 我很难解析一个xml文件。

The file was saved with UTF-8 Encoding. 该文件使用UTF-8编码保存。

Normal ASCII are read correctly, but Korean characters are not. 正常的ASCII读取正确,但韩文字符不正确。

So I made a simple program to read a UTF-8 text file and print the content. 所以我制作了一个简单的程序来读取UTF-8文本文件并打印内容。

Text File(test.txt) 文本文件(test.txt)

ABC가나다

Test Program 测试程序

#include <fstream>
#include <iostream>
#include <string>
#include <iterator>
#include <streambuf>

const char* hex(char c) {
    const char REF[] = "0123456789ABCDEF";
    static char output[3] = "XX";
    output[0] = REF[0x0f & c>>4];
    output[1] = REF[0x0f & c];
    return output;
}

int main() {
    std::cout << "File(ifstream) : ";
    std::ifstream file("test.txt");
    std::string buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    for (auto c : buffer) {
        std::cout << hex(c)<< " ";
    }
    std::cout << std::endl;
    std::cout << buffer << std::endl;

    //String literal
    std::string str = "ABC가나다";
    std::cout << "String literal : ";
    for (auto c : str) {
        std::cout << hex(c) << " ";
    }
    std::cout << std::endl;
    std::cout << str << std::endl;

    return 0;
}

Output 产量

File(ifstream) : 41 42 43 EA B0 80 EB 82 98 EB 8B A4
ABC媛?섎떎
String literal : 41 42 43 B0 A1 B3 AA B4 D9
ABC가나다

The output said that characters are encoded differently in string literal and file. 输出表示字符在字符串文字和文件中的编码方式不同。

So far as I know, in c++ char strings are encoded in UTF-8 so we can see them through printf or cout . 据我所知,在c ++中char字符串是用UTF-8编码的,所以我们可以通过printfcout看到它们。 So their bytes were supposed to be same, but they were different actually... 所以他们的字节应该是相同的,但它们实际上是不同的......

Is there any way to read UTF-8 text file using std::ifstream ? 有没有办法使用std::ifstream读取UTF-8文本文件?


I succeed to parse xml file using std::wifstream following this article . 我在本文后面使用std::wifstream成功解析了xml文件。

But most of the libraries I'm using are supporting only const char* string so I'm searching for another way to use std::ifstream . 但我正在使用的大多数库只支持const char* string,所以我正在寻找另一种使用std::ifstream

And also I've read this article saying that do not use wchar_t . 而且我也读过这篇文章说不要使用wchar_t Treating char string as multi-bytes character is sufficient. char字符串视为多字节字符就足够了。

Encoding "ABC가나다" using UTF-8 should give you 使用UTF-8编码“ABC가나다”应该会给你

"\x41\x42\x43\xEA\xB0\x80\xEB\x82\x98\xEB\x8B\xA4"

so the content of the file you got is correct. 所以你得到的文件内容是正确的。 The problems is with your source file encoding. 问题在于源文件编码。 You are not allowed to use non-ascii symbols in string literals like that, you should prefix them with u8 to get UTF-8 literal: 你不能在字符串文字中使用非ascii符号,你应该在它们前面加上u8来获得UTF-8文字:

u8"ABC가나다"

At this point I assume you are using Windows, otherwise you wouldn't have any issues with encodings. 此时我假设您使用的是Windows,否则编码不会有任何问题。 You will have to change your terminals character set to UTF-8: 您必须将终端字符集更改为UTF-8:

chcp 65001

What is happening in your case is that you are reading UTF-8 text from a file to a string, then printing it to non-unicode terminal which is unable to show it as you expect. 在您的情况下发生的是您正在从文件中读取UTF-8文本到字符串,然后将其打印到非unicode终端,该终端无法按预期显示。 When you are printing your string literal, you are printing non-unicode sequence, but this sequences enconding matches your terminal encoding, so you can see what you expected. 当您打印字符串文字时,您正在打印非unicode序列,但此序列符合您的终端编码,因此您可以看到您的预期。

PS: I used https://mothereff.in/utf-8 to get UTF-8 represenation of your string in hex. PS:我使用https://mothereff.in/utf-8以十六进制表示你的字符串的UTF-8表示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Qt从文件中读取UTF-8文本? - How to read UTF-8 text from file using Qt? 如何使用 std::ifstream 从文件读取字节到 std::array? - How to read bytes from file using std::ifstream to std::array? 如何检查文本文件是否以UTF-8编码? - How to check whether text file is encoded in UTF-8? 如何使用QTextStream在Linux上创建ISO 8859-15(而不是默认的UTF-8)编码的文本文件? - How to create an ISO 8859-15 (instead of default UTF-8) encoded text file on Linux using QTextStream? 将整个UTF-8文件读入std :: string - Read entire UTF-8 file into std::string 如何将std :: string写入UTF-8文本文件 - How to write a std::string to a UTF-8 text file 有关读取UTF-8编码的文本(C ++)时Ifstream get()方法行为的说明 - Explanations about the Ifstream get() method behaviour when reading UTF-8 encoded text (C++) C++:如何将 std::string 的内容写入 UTF-8 编码文件? - C++: How do I write the contents of std::string to a UTF-8 encoded file? 如何读取包含中文字符的UTF-8编码文件并在控制台上正确输出? - How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console? 如何将UTF-8编码的std :: string转换为UTF-16 std :: string - How to convert UTF-8 encoded std::string to UTF-16 std::string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM