[英]Writing utf16 to file in binary mode
I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. 我试图用二进制模式的ofstream写一个wstring文件,但我觉得我做错了。 This is what I've tried:
这就是我尝试过的:
ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();
Opening test.txt in for example Firefox with encoding set to UTF16 it will show as: 在例如Firefox中打开test.txt,编码设置为UTF16,它将显示为:
h e l l o 你好
Could anyone tell me why this happens? 谁能告诉我为什么会这样?
EDIT: 编辑:
Opening the file in a hex editor I get: 在十六进制编辑器中打开文件我得到:
FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00
Looks like I get two extra bytes in between every character for some reason? 看起来我出于某种原因在每个角色之间得到两个额外的字节?
Here we run into the little used locale properties. 在这里,我们遇到了很少使用的语言环境属性。 If you output your string as a string (rather than raw data) you can get the locale to do the appropriate conversion auto-magically.
如果将字符串输出为字符串(而不是原始数据),则可以使区域设置自动进行适当的转换。
NB This code does not take into account edianness of the wchar_t character.
注意:此代码未考虑wchar_t字符的edianness。
#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"
int main(int argc,char* argv[])
{
// construct a custom unicode facet and add it to a local.
UTF16Facet *unicodeFacet = new UTF16Facet();
const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);
// Create a stream and imbue it with the facet
std::wofstream saveFile;
saveFile.imbue(unicodeLocale);
// Now the stream is imbued we can open it.
// NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
saveFile.open("output.uni");
saveFile << L"This is my Data\n";
return(0);
}
The File: UTF16Facet.h 文件:UTF16Facet.h
#include <locale>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
// Loop over both the input and output array/
for(;(from < from_end) && (to < to_limit);from += 2,++to)
{
/*Input the Data*/
/* As the input 16 bits may not fill the wchar_t object
* Initialise it so that zero out all its bit's. This
* is important on systems with 32bit wchar_t objects.
*/
(*to) = L'\0';
/* Next read the data from the input stream into
* wchar_t object. Remember that we need to copy
* into the bottom 16 bits no matter what size the
* the wchar_t object is.
*/
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
}
from_next = from;
to_next = to;
return((from > from_end)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end) && (to < to_limit);++from,to += 2)
{
/* Output the Data */
/* NB I am assuming the characters are encoded as UTF-16.
* This means they are 16 bits inside a wchar_t object.
* As the size of wchar_t varies between platforms I need
* to take this into consideration and only take the bottom
* 16 bits of each wchar_t object.
*/
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
}
from_next = from;
to_next = to;
return((to > to_limit)?partial:ok);
}
};
It is easy if you use the C++11
standard (because there are a lot of additional includes like "utf8"
which solves this problems forever). 如果你使用
C++11
标准很容易(因为有很多额外的包括像"utf8"
,它永远解决了这个问题)。
But if you want to use multi-platform code with older standards, you can use this method to write with streams: 但是,如果要使用旧标准的多平台代码,可以使用此方法使用流写入:
stxutif.h
to your project from sources above stxutif.h
到您的项目 Open the file in ANSI mode and add the BOM to the start of a file, like this: 以ANSI模式打开文件并将BOM添加到文件的开头,如下所示:
std::ofstream fs; fs.open(filepath, std::ios::out|std::ios::binary); unsigned char smarker[3]; smarker[0] = 0xEF; smarker[1] = 0xBB; smarker[2] = 0xBF; fs << smarker; fs.close();
Then open the file as UTF
and write your content there: 然后将文件作为
UTF
打开并在那里写下您的内容:
std::wofstream fs; fs.open(filepath, std::ios::out|std::ios::app); std::locale utf8_locale(std::locale(), new utf8cvt<false>); fs.imbue(utf8_locale); fs << .. // Write anything you want...
I suspect that sizeof(wchar_t) is 4 in your environment - ie it's writing out UTF-32/UCS-4 instead of UTF-16. 我怀疑你的环境中sizeof(wchar_t)是4 - 即它写出UTF-32 / UCS-4而不是UTF-16。 That's certainly what the hex dump looks like.
这肯定是十六进制转储的样子。
That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on. 这很容易测试(只需打印出sizeof(wchar_t))但我很确定这是正在发生的事情。
To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play. 要从UTF-32 wstring转换为UTF-16,您需要应用适当的编码,因为代理对开始发挥作用。
On windows using wofstream and the utf16 facet defined above fails becuase the wofstream converts all bytes with the value 0A to 2 bytes 0D 0A, this is irrespective of how you pass the 0A byte in, '\\x0A', L'\\x0A', L'\\x000A', '\\n', L'\\n' and std::endl all give the same result. 在使用wofstream和上面定义的utf16 facet的窗口上失败,因为wofstream会将值为0A的所有字节转换为2字节0D 0A,这与您如何传递'\\ x0A',L'\\ x0A'中的0A字节无关, L'\\ x000A','\\ n',L'\\ n'和std :: endl都给出相同的结果。 On windows you have to open the file with an ofstream (not a wofsteam) in binary mode and write the output just like it is done in the original post.
在Windows上,你必须在二进制模式下使用ofstream(而不是wofsteam)打开文件,并像在原始帖子中一样编写输出。
The provided Utf16Facet
didn't work in gcc
for big strings, here is the version that worked for me... This way the file will be saved in UTF-16LE
. 提供的
Utf16Facet
在大字符串的gcc
不起作用,这是适用于我的版本......这样文件将以UTF-16LE
保存。 For UTF-16BE
, simply invert the assignments in do_in
and do_out
, eg to[0] = from[1]
and to[1] = from[0]
对于
UTF-16BE
,只需将do_in
和do_out
的赋值反转,例如反转to[0] = from[1]
to[1] = from[0]
#include <locale>
#include <bits/codecvt.h>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
for(;from < from_end;from += 2,++to)
{
if(to<=to_limit){
(*to) = L'\0';
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end);++from, to += 2)
{
if(to <= to_limit){
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
};
You should look at the output file in a hex editor such as WinHex so you can see the actual bits and bytes, to verify that the output is actually UTF-16. 您应该在十六进制编辑器(如WinHex)中查看输出文件,以便查看实际的位和字节,以验证输出实际上是UTF-16。 Post it here and let us know the result.
将它发布在这里,让我们知道结果。 That will tell us whether to blame Firefox or your C++ program.
这将告诉我们是否要归咎于Firefox或你的C ++程序。
But it looks to me like your C++ program works and Firefox is not interpreting your UTF-16 correctly. 但它看起来像你的C ++程序工作,Firefox没有正确解释你的UTF-16。 UTF-16 calls for two bytes for every character.
UTF-16为每个字符调用两个字节。 But Firefox is printing twice as many characters as it should, so it is probably trying to interpret your string as UTF-8 or ASCII, which generally just have 1 byte per character.
但是Firefox打印的字符数应该是它应该的两倍,因此它可能会尝试将您的字符串解释为UTF-8或ASCII,通常每个字符只有1个字节。
When you say "Firefox with encoding set to UTF16" what do you mean? 当你说“编码设置为UTF16的Firefox”是什么意思? I'm skeptical that that work work.
我怀疑这项工作是否奏效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.