如何在C++中将文件编码格式设置为UTF8

Question

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8.我的软件的要求是包含导出数据的文件的编码应为 UTF8。 But when I write the data to the file the encoding is always ANSI.但是当我将数据写入文件时，编码始终是 ANSI。 (I use Notepad++ to check this.) （我使用 Notepad++ 检查这个。）

What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.我目前正在做的是尝试通过读取手动转换文件，将其转换为 UTF8 并将文本写入新文件。

line is a std::string line是一个std::string
inputFile is an std::ifstream inputFile是一个std::ifstream
pOutputFile is a FILE* pOutputFile是一个FILE*

// ...

if( inputFile.is_open() )
{
    while( inputFile.good() )
    {
        getline(inputFile,line);

        //1
        DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
        wchar_t *pwcharText;
        pwcharText = new wchar_t[ dwCount];

        //2
        MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );

        //3
        dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
        char *pText;
        pText = new char[ dwCount ];

        //4
        WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );

        fprintf(pOutputFile,pText);
        fprintf(pOutputFile,"\n");

        delete[] pwcharText;
        delete[] pText;
    }
}

// ...

Unfortunately the encoding is still ANSI.不幸的是，编码仍然是 ANSI。 I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte.我搜索了一段时间的解决方案，但我总是通过 MultiByteToWideChar 和 WideCharToMultiByte 遇到解决方案。 However, this doesn't seem to work.但是，这似乎不起作用。 What am I missing here?我在这里错过了什么？

I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.我也在 SO 上寻找解决方案，但大多数 UTF8 问题都涉及 C# 和 php 的东西。

Answer 1

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (ie in C++11).在 VC++2010 中的 Windows 上，有可能（据我所知尚未在 GCC 中实现）使用本地化方面 std::codecvt_utf8_utf16（即在 C++11 中）。 The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.来自cppreference.com的示例代码包含读/写 UTF-8 文件所需的所有基本信息。

std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;

It sets the ANSI encoded file to UTF-8 (checked in Notepad).它将 ANSI 编码文件设置为 UTF-8（在记事本中选中）。 Hope this is what you need.希望这就是您所需要的。

Answer 2

On Windows, files don't have encodings.在 Windows 上，文件没有编码。 Each application will assume an encoding based on its own rules.每个应用程序都将根据自己的规则采用一种编码。 The best you can do is put a byte-order mark at the front of the file and hope it's recognized.您能做的最好的事情就是在文件的前面放置一个字节顺序标记，并希望它能被识别。

Answer 3

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. AFAIK， fprintf()进行字符转换，因此不能保证将 UTF-8 编码数据传递给它实际上会将 UTF-8 写入文件。 Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:由于您已经自己转换了数据，因此请改用fwrite()以便按原样编写 UTF-8 数据，例如：

DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );  
if (dwCount == 0) continue;

std::vector<WCHAR> utf16Text(dwCount);  
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );  

dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );  
if (dwCount == 0) continue;

std::vector<CHAR> utf8Text(dwCount);  
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );  

fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);  
fprintf(pOutputFile, "\n");

Answer 4

The type char has no clue of any encoding, all it can do is store 8 bits. char类型没有任何编码的线索，它所能做的就是存储 8 位。 Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding.因此，任何文本文件都只是一个字节序列，用户必须猜测底层编码。 A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more.以 BOM 开头的文件表示 UTF 8，但不建议再使用 BOM。 The type wchar_t in contrast is in Windows always interpreted as UTF 16.相比之下，类型wchar_t在 Windows 中总是被解释为 UTF 16。

So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说：微笑！."假设您有一个以 UTF 8 编码的文件，只有一行：“Confucius says: Smile. 孔子说：微笑！”。 The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA .以下代码片段再次附加此文本，然后读取第一行并将其显示在MessageBoxW和MessageBoxA中。 Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.请注意， MessageBoxW显示正确的文本，而MessageBoxA显示一些垃圾，因为它假定我的本地代码页 1252 用于char*字符串。

Note that I have used the handy CA2W class instead of MultiByteToWideChar .请注意，我使用了方便的CA2W类而不是MultiByteToWideChar 。 Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.请注意， CP_Whatever参数是可选的，如果省略，则使用本地代码页。

#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>

int main(int argc, char** argv)
{
  std::fstream  afile;
  std::string line1A = u8"Confucius says: Smile. 孔子说：微笑！ 😊";
  std::wstring line1W;

  afile.open("Test.txt", std::ios::out | std::ios::app);
  if (!afile.is_open())
        return 0;

  afile << "\n" << line1A;
  afile.close();

  afile.open("Test.txt", std::ios::in);
  std::getline(afile, line1A);
  line1W = CA2W(line1A.c_str(), CP_UTF8);
  MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
  MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
  afile.close();

  return 0;
}

如何在C++中将文件编码格式设置为UTF8

问题描述

4 个解决方案

解决方案1
3 已采纳 2012-07-25 09:50:57

解决方案2
3 2012-07-26 01:16:06

解决方案3
0 2012-07-26 01:12:33

解决方案4
0 2020-06-16 10:29:45

如何在C++中将文件编码格式设置为UTF8

问题描述

4 个解决方案

解决方案1 3 已采纳 2012-07-25 09:50:57

解决方案2 3 2012-07-26 01:16:06

解决方案3 0 2012-07-26 01:12:33

解决方案4 0 2020-06-16 10:29:45

解决方案1
3 已采纳 2012-07-25 09:50:57

解决方案2
3 2012-07-26 01:16:06

解决方案3
0 2012-07-26 01:12:33

解决方案4
0 2020-06-16 10:29:45