[英]Writing utf16 to file in binary mode
我試圖用二進制模式的ofstream寫一個wstring文件,但我覺得我做錯了。 這就是我嘗試過的:
ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();
在例如Firefox中打開test.txt,編碼設置為UTF16,它將顯示為:
你好
誰能告訴我為什么會這樣?
編輯:
在十六進制編輯器中打開文件我得到:
FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00
看起來我出於某種原因在每個角色之間得到兩個額外的字節?
在這里,我們遇到了很少使用的語言環境屬性。 如果將字符串輸出為字符串(而不是原始數據),則可以使區域設置自動進行適當的轉換。
注意:此代碼未考慮wchar_t字符的edianness。
#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"
int main(int argc,char* argv[])
{
// construct a custom unicode facet and add it to a local.
UTF16Facet *unicodeFacet = new UTF16Facet();
const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);
// Create a stream and imbue it with the facet
std::wofstream saveFile;
saveFile.imbue(unicodeLocale);
// Now the stream is imbued we can open it.
// NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
saveFile.open("output.uni");
saveFile << L"This is my Data\n";
return(0);
}
文件:UTF16Facet.h
#include <locale>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
// Loop over both the input and output array/
for(;(from < from_end) && (to < to_limit);from += 2,++to)
{
/*Input the Data*/
/* As the input 16 bits may not fill the wchar_t object
* Initialise it so that zero out all its bit's. This
* is important on systems with 32bit wchar_t objects.
*/
(*to) = L'\0';
/* Next read the data from the input stream into
* wchar_t object. Remember that we need to copy
* into the bottom 16 bits no matter what size the
* the wchar_t object is.
*/
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
}
from_next = from;
to_next = to;
return((from > from_end)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end) && (to < to_limit);++from,to += 2)
{
/* Output the Data */
/* NB I am assuming the characters are encoded as UTF-16.
* This means they are 16 bits inside a wchar_t object.
* As the size of wchar_t varies between platforms I need
* to take this into consideration and only take the bottom
* 16 bits of each wchar_t object.
*/
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
}
from_next = from;
to_next = to;
return((to > to_limit)?partial:ok);
}
};
如果你使用C++11
標准很容易(因為有很多額外的包括像"utf8"
,它永遠解決了這個問題)。
但是,如果要使用舊標准的多平台代碼,可以使用此方法使用流寫入:
stxutif.h
到您的項目 以ANSI模式打開文件並將BOM添加到文件的開頭,如下所示:
std::ofstream fs; fs.open(filepath, std::ios::out|std::ios::binary); unsigned char smarker[3]; smarker[0] = 0xEF; smarker[1] = 0xBB; smarker[2] = 0xBF; fs << smarker; fs.close();
然后將文件作為UTF
打開並在那里寫下您的內容:
std::wofstream fs; fs.open(filepath, std::ios::out|std::ios::app); std::locale utf8_locale(std::locale(), new utf8cvt<false>); fs.imbue(utf8_locale); fs << .. // Write anything you want...
我懷疑你的環境中sizeof(wchar_t)是4 - 即它寫出UTF-32 / UCS-4而不是UTF-16。 這肯定是十六進制轉儲的樣子。
這很容易測試(只需打印出sizeof(wchar_t))但我很確定這是正在發生的事情。
要從UTF-32 wstring轉換為UTF-16,您需要應用適當的編碼,因為代理對開始發揮作用。
在使用wofstream和上面定義的utf16 facet的窗口上失敗,因為wofstream會將值為0A的所有字節轉換為2字節0D 0A,這與您如何傳遞'\\ x0A',L'\\ x0A'中的0A字節無關, L'\\ x000A','\\ n',L'\\ n'和std :: endl都給出相同的結果。 在Windows上,你必須在二進制模式下使用ofstream(而不是wofsteam)打開文件,並像在原始帖子中一樣編寫輸出。
提供的Utf16Facet
在大字符串的gcc
不起作用,這是適用於我的版本......這樣文件將以UTF-16LE
保存。 對於UTF-16BE
,只需將do_in
和do_out
的賦值反轉,例如反轉to[0] = from[1]
to[1] = from[0]
#include <locale>
#include <bits/codecvt.h>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
for(;from < from_end;from += 2,++to)
{
if(to<=to_limit){
(*to) = L'\0';
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end);++from, to += 2)
{
if(to <= to_limit){
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
};
您應該在十六進制編輯器(如WinHex)中查看輸出文件,以便查看實際的位和字節,以驗證輸出實際上是UTF-16。 將它發布在這里,讓我們知道結果。 這將告訴我們是否要歸咎於Firefox或你的C ++程序。
但它看起來像你的C ++程序工作,Firefox沒有正確解釋你的UTF-16。 UTF-16為每個字符調用兩個字節。 但是Firefox打印的字符數應該是它應該的兩倍,因此它可能會嘗試將您的字符串解釋為UTF-8或ASCII,通常每個字符只有1個字節。
當你說“編碼設置為UTF16的Firefox”是什么意思? 我懷疑這項工作是否奏效。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.