简体   繁体   English

如何在C++中将文件编码格式设置为UTF8

[英]How to set file encoding format to UTF8 in C++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8.我的软件的要求是包含导出数据的文件的编码应为 UTF8。 But when I write the data to the file the encoding is always ANSI.但是当我将数据写入文件时,编码始终是 ANSI。 (I use Notepad++ to check this.) (我使用 Notepad++ 检查这个。)

What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.我目前正在做的是尝试通过读取手动转换文件,将其转换为 UTF8 并将文本写入新文件。

line is a std::string line是一个std::string
inputFile is an std::ifstream inputFile是一个std::ifstream
pOutputFile is a FILE* pOutputFile是一个FILE*

// ...

if( inputFile.is_open() )
{
    while( inputFile.good() )
    {
        getline(inputFile,line);

        //1
        DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
        wchar_t *pwcharText;
        pwcharText = new wchar_t[ dwCount];

        //2
        MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );

        //3
        dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
        char *pText;
        pText = new char[ dwCount ];

        //4
        WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );

        fprintf(pOutputFile,pText);
        fprintf(pOutputFile,"\n");

        delete[] pwcharText;
        delete[] pText;
    }
}

// ...

Unfortunately the encoding is still ANSI.不幸的是,编码仍然是 ANSI。 I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte.我搜索了一段时间的解决方案,但我总是通过 MultiByteToWideChar 和 WideCharToMultiByte 遇到解决方案。 However, this doesn't seem to work.但是,这似乎不起作用。 What am I missing here?我在这里错过了什么?

I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.我也在 SO 上寻找解决方案,但大多数 UTF8 问题都涉及 C# 和 php 的东西。

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (ie in C++11).在 VC++2010 中的 Windows 上,有可能(据我所知尚未在 GCC 中实现)使用本地化方面 std::codecvt_utf8_utf16(即在 C++11 中)。 The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.来自cppreference.com的示例代码包含读/写 UTF-8 文件所需的所有基本信息。

std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;

It sets the ANSI encoded file to UTF-8 (checked in Notepad).它将 ANSI 编码文件设置为 UTF-8(在记事本中选中)。 Hope this is what you need.希望这就是您所需要的。

On Windows, files don't have encodings.在 Windows 上,文件没有编码。 Each application will assume an encoding based on its own rules.每个应用程序都将根据自己的规则采用一种编码。 The best you can do is put a byte-order mark at the front of the file and hope it's recognized.您能做的最好的事情就是在文件的前面放置一个字节顺序标记,并希望它能被识别。

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. AFAIK, fprintf()进行字符转换,因此不能保证将 UTF-8 编码数据传递给它实际上会将 UTF-8 写入文件。 Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:由于您已经自己转换了数据,因此请改用fwrite()以便按原样编写 UTF-8 数据,例如:

DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );  
if (dwCount == 0) continue;

std::vector<WCHAR> utf16Text(dwCount);  
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );  

dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );  
if (dwCount == 0) continue;

std::vector<CHAR> utf8Text(dwCount);  
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );  

fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);  
fprintf(pOutputFile, "\n");  

The type char has no clue of any encoding, all it can do is store 8 bits. char类型没有任何编码的线索,它所能做的就是存储 8 位。 Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding.因此,任何文本文件都只是一个字节序列,用户必须猜测底层编码。 A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more.以 BOM 开头的文件表示 UTF 8,但不建议再使用 BOM。 The type wchar_t in contrast is in Windows always interpreted as UTF 16.相比之下,类型wchar_t在 Windows 中总是被解释为 UTF 16。

So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说:微笑!."假设您有一个以 UTF 8 编码的文件,只有一行:“Confucius says: Smile. 孔子说:微笑!”。 The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA .以下代码片段再次附加此文本,然后读取第一行并将其显示在MessageBoxWMessageBoxA中。 Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.请注意, MessageBoxW显示正确的文本,而MessageBoxA显示一些垃圾,因为它假定我的本地代码页 1252 用于char*字符串。

Note that I have used the handy CA2W class instead of MultiByteToWideChar .请注意,我使用了方便的CA2W类而不是MultiByteToWideChar Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.请注意, CP_Whatever参数是可选的,如果省略,则使用本地代码页。

#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>

int main(int argc, char** argv)
{
  std::fstream  afile;
  std::string line1A = u8"Confucius says: Smile. 孔子说:微笑! 😊";
  std::wstring line1W;

  afile.open("Test.txt", std::ios::out | std::ios::app);
  if (!afile.is_open())
        return 0;

  afile << "\n" << line1A;
  afile.close();

  afile.open("Test.txt", std::ios::in);
  std::getline(afile, line1A);
  line1W = CA2W(line1A.c_str(), CP_UTF8);
  MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
  MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
  afile.close();

  return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM