简体   繁体   English

C ++和Boost:编码/解码UTF-8

[英]C++ & Boost: encode/decode UTF-8

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string , encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring . 我正在尝试做一个非常简单的任务:获取一个支持unicode的wstring并将其转换为一个string ,编码为UTF8字节,然后相反的方式:获取一个包含UTF8字节的string并将其转换为unicode-aware wstring

The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. 问题是,我需要它跨平台,我需要它与Boost一起工作......而我似乎无法想办法让它工作。 I've been toying with 我一直在玩弄

Trying to convert the code to use stringstream / wstringstream instead of files of whatever, but nothing seems to work. 试图将代码转换为使用stringstream / wstringstream而不是任何文件,但似乎没有任何作用。

For instance, in Python it would look like so: 例如,在Python中它看起来像这样:

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

What I'm ultimately after is this: 我最终追求的是:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost. 我真的不想在ICU上添加另一种依赖关系......或者根据我的理解,应该可以使用Boost。

Some sample code would greatly be appreciated! 一些示例代码将非常感谢! Thanks 谢谢

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. 谢谢大家,但最终我使用了http://utfcpp.sourceforge.net/ - 它是一个非常轻量级且易于使用的仅限标头的库。 I'm sharing a demo code here, should anyone find it useful: 我在这里分享一个演示代码,如果有人发现它有用:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage: 用法:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this 注释中已经有一个boost链接,但在几乎标准的C ++ 0x中,有wstring_convert执行此操作

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9 使用MS Visual Studio 2010 EE SP1或CLang ++ 2.9编译时的输出

true 
true

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16 Boost.Locale在Boost 1。48(2011年11月15日)发布,更容易转换为UTF8 / 16

Here are some convenient examples from the docs: 以下是文档中的一些方便示例:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

Almost as easy as Python encoding/decoding :) 几乎和Python编码/解码一样简单:)

Note that Boost.Locale is not a header-only library. 请注意,Boost.Locale不是仅限标头的库。

For a drop-in replacement for std::string / std::wstring that handles utf8, see TINYUTF8 . 有关处理utf8的std::string / std::wstring 替代品 ,请参阅TINYUTF8

In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library. <codecvt>结合使用,您可以从/向utf8转换每个编码,然后通过上面的库处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM