简体   繁体   English

使用标准C ++ wifstream读取UTF-8文本并转换为UTF-16

[英]Reading UTF-8 text and converting to UTF-16 using standard C++ wifstream

I'd like to read some text from a file that uses UTF-8 encoding and convert it to UTF-16, using std::wifstream , something like this: 我想从使用UTF-8编码的文件中读取一些文本并使用std::wifstream将其转换为UTF-16,如下所示:

//
// Read UTF-8 text and convert to UTF-16
//
std::wifstream src;
src.imbue(std::locale("???"));          // UTF-8 ???
src.open("some_text_file_using_utf8");
std::wstring line;                      // UTF-16 string
while (std::getline(src, line))
{
    ... do something processing the UTF-16 string ...
}

Is there a standard locale name for the UTF-8 conversion? 是否有UTF-8转换的标准区域设置名称?
Is it possible to achieve that goal using std::locale ? 是否可以使用std::locale实现该目标?

I'm using Visual Studio 2013. 我正在使用Visual Studio 2013。


NOTE: 注意:

I know that I/O streams tend to be slow, and it's possible to use Win32 memory mapped files for faster reading, and MultiByteToWideChar() Win32 API for the conversion, etc. 我知道I / O流往往很慢,并且可以使用Win32内存映射文件来更快地读取,并使用MultiByteToWideChar() Win32 API进行转换等。
But for this particular case I'd like a solution that only uses standard C++ and its standard library , without Boost. 但对于这个特殊情况,我想要一个只使用标准C ++及其标准库的解决方案, 而不需要 Boost。

If the C++ standard library just can't do that, the second option would be to use Boost ; 如果C ++标准库不能这样做, 第二个选项是使用Boost ; in this case, which Boost library should I use? 在这种情况下,我应该使用哪个Boost库?

This works on Windows with Visual Studio, I think as far back as VS2010 这适用于使用Visual Studio的Windows,我认为可以追溯到VS2010

#include <locale>  // consume_header, locale
#include <codecvt> // codecvt_utf8_utf16

src.imbue(std::locale(
    src.getloc(),
    new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header>));

Since Windows uses a 16-bit wchar_t and also universally uses UTF-16 as the wide character encoding this works great in that environment. 由于Windows使用16位wchar_t并且通常使用UTF-16作为宽字符编码,因此在该环境中工作得很好。 (And because I'm assuming a Windows environment my example includes consume_header to handle Windows' convention of adding a header to UTF-8 data). (因为我假设一个Windows环境,我的例子包括consume_header来处理Windows向UTF-8数据添加标题的惯例)。

On other platforms wchar_t is generally 32-bit and, while you can store UTF-16 code unit values in such 32-bit code units, nothing else will be written expecting such a thing. 在其他平台上, wchar_t通常是32位,虽然您可以在这样的32位代码单元中存储UTF-16代码单元值,但是没有其他任何东西可以写入期望这样的东西。 On a platform with 32-bit wchar_t you might prefer to use std::codecvt_utf8<wchar_t> to produce UTF-32 wide strings. 在具有32位wchar_t的平台上,您可能更喜欢使用std::codecvt_utf8<wchar_t>来生成UTF-32宽字符串。


For portability ideally what you'd want is a codecvt facet that knows how to convert from UTF-8 to either the locale's wchar_t encoding or the wide execution encoding. 理想情况下,为了便携性,您需要的是一个codecvt方面,它知道如何从UTF-8转换为语言环境的wchar_t编码或宽执行编码。 The problem with that, however, is that there's no requirement for any wide encoding to support the entire range of characters representable in UTF-8. 然而,问题在于,不需要任何宽编码来支持UTF-8中可表示的整个字符范围。 The bottom line is that wchar_t isn't particularly useful for portable code as specified. 底线是wchar_t对于指定的可移植代码不是特别有用。

However one trick that might be useful if you're sticking to platforms that use UTF-16 or UTF-32 depending on the size of wchar_t is: 但是,如果你坚持使用UTF-16或UTF-32的平台取决于wchar_t的大小,那么可能有用的一个技巧是:

template <int N> struct get_codecvt_utf8_wchar_impl;
template <> struct get_codecvt_utf8_wchar_impl<16> {
  using type = std::codecvt_utf8_utf16<wchar_t>;
};
template <> struct get_codecvt_utf8_wchar_impl<32> {
  using type = std::codecvt_utf8<wchar_t>;
};

using codecvt_utf8_wchar = get_codecvt_utf8_wchar_impl<
    sizeof(wchar_t) * CHAR_BIT>::type;

src.imbue(std::locale(src.getloc(), new codecvt_utf8_wchar));

You can also use char16_t and char32_t , which would lend themselves to portable code, however the standard is missing a few bits to make iostreams usable with these character types and also implementations don't fully support what is specified. 您还可以使用char16_tchar32_t ,它们可以使用自己的可移植代码,但是标准缺少一些位以使iostream可以与这些字符类型一起使用,并且实现也不完全支持指定的内容。

VS I think still implements char16_t and char32_t as typedefs and so the template specializations using them don't work (even though the specializations do exist if you look in the headers, they're just ifdef'd out because the compiler can't handle them). VS我认为仍然将char16_tchar32_t实现为typedef,因此使用它们的模板特化不起作用(即使专业化确实存在,如果你查看标题,它们只是因为编译器无法处理而被删除他们)。 libstdc++ doesn't implement the template specializations yet even though it supports char16_t and char32_t as real types. libstdc ++尚未实现模板特化,即使它支持char16_tchar32_t作为实际类型。 The most complete implementation I know of is libc++ with a suitable compiler (gcc or clang), but even that is still missing the <cuchar> header. 我所知道的最完整的实现是带有合适编译器(gcc或clang)的libc ++,但即便如此,仍然缺少<cuchar>头。

Since implementation support is limited that sort of prevents portable code from doing much with these besides using them as a consistent representation in user code across platforms (though that is useful even on its own). 由于实现支持是有限的,除了使用它们作为跨平台的用户代码中的一致表示(尽管这甚至单独使用)之外,这种方式可以防止可移植代码对这些做很多事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM