简体   繁体   English

将utf16写入二进制模式的文件

[英]Writing utf16 to file in binary mode

I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. 我试图用二进制模式的ofstream写一个wstring文件,但我觉得我做错了。 This is what I've tried: 这就是我尝试过的:

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

Opening test.txt in for example Firefox with encoding set to UTF16 it will show as: 在例如Firefox中打开test.txt,编码设置为UTF16,它将显示为:

h e l l o 你好

Could anyone tell me why this happens? 谁能告诉我为什么会这样?

EDIT: 编辑:

Opening the file in a hex editor I get: 在十六进制编辑器中打开文件我得到:

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00 

Looks like I get two extra bytes in between every character for some reason? 看起来我出于某种原因在每个角色之间得到两个额外的字节?

Here we run into the little used locale properties. 在这里,我们遇到了很少使用的语言环境属性。 If you output your string as a string (rather than raw data) you can get the locale to do the appropriate conversion auto-magically. 如果将字符串输出为字符串(而不是原始数据),则可以使区域设置自动进行适当的转换。

NB This code does not take into account edianness of the wchar_t character. 注意:此代码未考虑wchar_t字符的edianness。

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}    

The File: UTF16Facet.h 文件:UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever). 如果你使用C++11标准很容易(因为有很多额外的包括像"utf8" ,它永远解决了这个问题)。

But if you want to use multi-platform code with older standards, you can use this method to write with streams: 但是,如果要使用旧标准的多平台代码,可以使用此方法使用流写入:

  1. Read the article about UTF converter for streams 阅读有关流的UTF转换器的文章
  2. Add stxutif.h to your project from sources above 从上面的源添加stxutif.h到您的项目
  3. Open the file in ANSI mode and add the BOM to the start of a file, like this: 以ANSI模式打开文件并将BOM添加到文件的开头,如下所示:

     std::ofstream fs; fs.open(filepath, std::ios::out|std::ios::binary); unsigned char smarker[3]; smarker[0] = 0xEF; smarker[1] = 0xBB; smarker[2] = 0xBF; fs << smarker; fs.close(); 
  4. Then open the file as UTF and write your content there: 然后将文件作为UTF打开并在那里写下您的内容:

     std::wofstream fs; fs.open(filepath, std::ios::out|std::ios::app); std::locale utf8_locale(std::locale(), new utf8cvt<false>); fs.imbue(utf8_locale); fs << .. // Write anything you want... 

I suspect that sizeof(wchar_t) is 4 in your environment - ie it's writing out UTF-32/UCS-4 instead of UTF-16. 我怀疑你的环境中sizeof(wchar_t)是4 - 即它写出UTF-32 / UCS-4而不是UTF-16。 That's certainly what the hex dump looks like. 这肯定是十六进制转储的样子。

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on. 这很容易测试(只需打印出sizeof(wchar_t))但我很确定这是正在发生的事情。

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play. 要从UTF-32 wstring转换为UTF-16,您需要应用适当的编码,因为代理对开始发挥作用。

On windows using wofstream and the utf16 facet defined above fails becuase the wofstream converts all bytes with the value 0A to 2 bytes 0D 0A, this is irrespective of how you pass the 0A byte in, '\\x0A', L'\\x0A', L'\\x000A', '\\n', L'\\n' and std::endl all give the same result. 在使用wofstream和上面定义的utf16 facet的窗口上失败,因为wofstream会将值为0A的所有字节转换为2字节0D 0A,这与您如何传递'\\ x0A',L'\\ x0A'中的0A字节无关, L'\\ x000A','\\ n',L'\\ n'和std :: endl都给出相同的结果。 On windows you have to open the file with an ofstream (not a wofsteam) in binary mode and write the output just like it is done in the original post. 在Windows上,你必须在二进制模式下使用ofstream(而不是wofsteam)打开文件,并像在原始帖子中一样编写输出。

The provided Utf16Facet didn't work in gcc for big strings, here is the version that worked for me... This way the file will be saved in UTF-16LE . 提供的Utf16Facet在大字符串的gcc不起作用,这是适用于我的版本......这样文件将以UTF-16LE保存。 For UTF-16BE , simply invert the assignments in do_in and do_out , eg to[0] = from[1] and to[1] = from[0] 对于UTF-16BE ,只需将do_indo_out的赋值反转,例如反转to[0] = from[1] to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

You should look at the output file in a hex editor such as WinHex so you can see the actual bits and bytes, to verify that the output is actually UTF-16. 您应该在十六进制编辑器(如WinHex)中查看输出文件,以便查看实际的位和字节,以验证输出实际上是UTF-16。 Post it here and let us know the result. 将它发布在这里,让我们知道结果。 That will tell us whether to blame Firefox or your C++ program. 这将告诉我们是否要归咎于Firefox或你的C ++程序。

But it looks to me like your C++ program works and Firefox is not interpreting your UTF-16 correctly. 但它看起来像你的C ++程序工作,Firefox没有正确解释你的UTF-16。 UTF-16 calls for two bytes for every character. UTF-16为每个字符调用两个字节。 But Firefox is printing twice as many characters as it should, so it is probably trying to interpret your string as UTF-8 or ASCII, which generally just have 1 byte per character. 但是Firefox打印的字符数应该是它应该的两倍,因此它可能会尝试将您的字符串解释为UTF-8或ASCII,通常每个字符只有1个字节。

When you say "Firefox with encoding set to UTF16" what do you mean? 当你说“编码设置为UTF16的Firefox”是什么意思? I'm skeptical that that work work. 我怀疑这项工作是否奏效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM