C++ / wcout / UTF-8

Question

我正在讀取一個 UTF-8 編碼的 unicode 文本文件，並將其輸出到控制台，但顯示的字符與我用來創建文件的文本編輯器中的字符不同。 這是我的代碼：

#define UNICODE

#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>

#include "pugixml.hpp"

using std::ifstream;
using std::ios;
using std::string;
using std::wstring;

int main( int argc, char * argv[] )
{
    ifstream oFile;

    try
    {
        string sContent;

        oFile.open ( "../config-sample.xml", ios::in );

        if( oFile.is_open() )
        {
            wchar_t wsBuffer[128];

            while( oFile.good() )
            {
                oFile >> sContent;
                mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
              //wprintf( wsBuffer );// Same result as wcout.
                wcout << wsBuffer;
            }

            Sleep(100000);
        }
        else
        {
            throw L"Failed to open file";
        }
    }
    catch( const wchar_t * pwsMsg )
    {
        ::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
    }

    if( oFile.is_open() )
    {
        oFile.close();
    }

    return 0;
}

一定有一些我不明白的編碼。

Answer 1

寬字符串並不意味着 UTF-8。 其實恰恰相反：UTF-8 表示 Unicode Transformation Format（8 位）； 這是一種通過 8 位字符表示 Unicode 的方法，因此您的普通char s。 您應該將其讀入普通字符串（不是寬字符串）。

寬字符串使用wchar_t ，在 Windows 上是 16 位。 操作系統將 UTF-16 用於其“寬”功能。

在 Windows 上，可以使用MultiByteToWideChar將 UTF-8 字符串轉換為 UTF-16。

Answer 2

問題是mbstowcs實際上並不使用 UTF-8。 它使用舊式的“多字節代碼點”，它與 UTF-8 不兼容（盡管技術上[我相信]可以定義 UTF-8 代碼頁，但在 Windows 中沒有這樣的東西）。

如果要將 UTF-8 轉換為 UTF-16，可以使用MultiByteToWideChar ， codepage為CP_UTF8 。

Answer 3

我制作了一個 C++ char_t容器，最多可容納 6 個 8 位 char_t 將其存儲在std::vector 。 將其與wchar_t轉換或將其附加到std::string 。

在這里查看：在 Github 上查看 UTF-8_String 結構

#include "UTF-8_String.h" //header from github link above

iBS::u8str  raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;

下面是將 wchar_t 轉換為 uint32_t 的函數，這些函數位於上面標頭中的 u8char 結構中。

    #include <cwchar>

    u8char& operator=(wchar_t& wc)
    {
        char temp[6];
        std::mbstate_t state ;
        int ret = std::wcrtomb((&temp[0]), wc, &state);
        ref.resize(ret);
        for (short i=0; i<ret; ++i) 
            ref[i]=temp[i];
        return *this;
    };

Answer 4

我發現wifstream工作得非常好，即使在 Visual Studio 調試器中也能正確顯示 UTF-8 單詞（我正在閱讀繁體中文單詞），來自這篇文章：

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}
 
//  usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;

C++ / wcout / UTF-8

問題描述

4 個解決方案

解決方案1
2 2013-09-07 22:03:44

解決方案2
2 已采納 2013-09-07 22:12:36

解決方案3
0 2016-05-17 16:44:23

解決方案4
0 2021-04-15 15:28:11

C++ / wcout / UTF-8

問題描述

4 個解決方案

解決方案1 2 2013-09-07 22:03:44

解決方案2 2 已采納 2013-09-07 22:12:36

解決方案3 0 2016-05-17 16:44:23

解決方案4 0 2021-04-15 15:28:11

解決方案1
2 2013-09-07 22:03:44

解決方案2
2 已采納 2013-09-07 22:12:36

解決方案3
0 2016-05-17 16:44:23

解決方案4
0 2021-04-15 15:28:11