將utf16寫入二進制模式的文件

Question

我試圖用二進制模式的ofstream寫一個wstring文件，但我覺得我做錯了。 這就是我嘗試過的：

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

在例如Firefox中打開test.txt，編碼設置為UTF16，它將顯示為：

你好

誰能告訴我為什么會這樣？

編輯：

在十六進制編輯器中打開文件我得到：

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

看起來我出於某種原因在每個角色之間得到兩個額外的字節？

Answer 1

在這里，我們遇到了很少使用的語言環境屬性。 如果將字符串輸出為字符串（而不是原始數據），則可以使區域設置自動進行適當的轉換。

注意：此代碼未考慮wchar_t字符的edianness。

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}

文件：UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

Answer 2

如果你使用C++11標准很容易（因為有很多額外的包括像"utf8" ，它永遠解決了這個問題）。

但是，如果要使用舊標准的多平台代碼，可以使用此方法使用流寫入：

閱讀有關流的UTF轉換器的文章
從上面的源添加stxutif.h到您的項目

以ANSI模式打開文件並將BOM添加到文件的開頭，如下所示：

 std::ofstream fs; fs.open(filepath, std::ios::out|std::ios::binary); unsigned char smarker[3]; smarker[0] = 0xEF; smarker[1] = 0xBB; smarker[2] = 0xBF; fs << smarker; fs.close();

然后將文件作為UTF打開並在那里寫下您的內容：

 std::wofstream fs; fs.open(filepath, std::ios::out|std::ios::app); std::locale utf8_locale(std::locale(), new utf8cvt<false>); fs.imbue(utf8_locale); fs << .. // Write anything you want...

Answer 3

我懷疑你的環境中sizeof（wchar_t）是4 - 即它寫出UTF-32 / UCS-4而不是UTF-16。 這肯定是十六進制轉儲的樣子。

這很容易測試（只需打印出sizeof（wchar_t））但我很確定這是正在發生的事情。

要從UTF-32 wstring轉換為UTF-16，您需要應用適當的編碼，因為代理對開始發揮作用。

Answer 4

在使用wofstream和上面定義的utf16 facet的窗口上失敗，因為wofstream會將值為0A的所有字節轉換為2字節0D 0A，這與您如何傳遞'\\ x0A'，L'\\ x0A'中的0A字節無關， L'\\ x000A'，'\\ n'，L'\\ n'和std :: endl都給出相同的結果。 在Windows上，你必須在二進制模式下使用ofstream（而不是wofsteam）打開文件，並像在原始帖子中一樣編寫輸出。

Answer 5

提供的Utf16Facet在大字符串的gcc不起作用，這是適用於我的版本......這樣文件將以UTF-16LE保存。 對於UTF-16BE ，只需將do_in和do_out的賦值反轉，例如反轉to[0] = from[1] to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

Answer 6

您應該在十六進制編輯器（如WinHex）中查看輸出文件，以便查看實際的位和字節，以驗證輸出實際上是UTF-16。 將它發布在這里，讓我們知道結果。 這將告訴我們是否要歸咎於Firefox或你的C ++程序。

但它看起來像你的C ++程序工作，Firefox沒有正確解釋你的UTF-16。 UTF-16為每個字符調用兩個字節。 但是Firefox打印的字符數應該是它應該的兩倍，因此它可能會嘗試將您的字符串解釋為UTF-8或ASCII，通常每個字符只有1個字節。

當你說“編碼設置為UTF16的Firefox”是什么意思？ 我懷疑這項工作是否奏效。

將utf16寫入二進制模式的文件

問題描述

6 個解決方案

解決方案1
14 2008-10-16 12:56:58

解決方案2
6 2012-09-20 07:45:14

解決方案3
6 已采納 2008-10-16 07:47:34

解決方案4
2

解決方案5
1 2012-06-09 02:59:52

解決方案6
0 2008-10-16 07:30:13

將utf16寫入二進制模式的文件

問題描述

6 個解決方案

解決方案1 14 2008-10-16 12:56:58

解決方案2 6 2012-09-20 07:45:14

解決方案3 6 已采納 2008-10-16 07:47:34

解決方案4 2

解決方案5 1 2012-06-09 02:59:52

解決方案6 0 2008-10-16 07:30:13

解決方案1
14 2008-10-16 12:56:58

解決方案2
6 2012-09-20 07:45:14

解決方案3
6 已采納 2008-10-16 07:47:34

解決方案4
2

解決方案5
1 2012-06-09 02:59:52

解決方案6
0 2008-10-16 07:30:13