简体   繁体   中英

UTF8 to UTF16 conversion using std::filesystem::path

Starting from C++11 one can convert UTF8 to UTF16 wchar_t (at least on Windows, where wchar_t is 16 bit wide) using std::codecvt_utf8_utf16 :

std::wstring utf8ToWide( const char* utf8 )
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    return converter.from_bytes( utf8 );
}

Unfortunately in C++17, std::codecvt_utf8_utf16 is deprecated. But there is std::filesystem::path with all possible conversions inside, eg it has members

std::string string() const;
std::wstring wstring() const;
std::u8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;

So the above function can be rewritten as follows:

std::wstring utf8ToWide( const char* utf8 )
{
    return std::filesystem::path( (const char8_t*) utf8 ).wstring();
}

And unlike std::codecvt_utf8_utf16 this will not use any deprecated piece of C++.

What kind of drawbacks can be expected from such converter? For example, path cannot be longer than certain length or certain Unicode symbols are prohibited there?

What kind of drawbacks can be expected from such converter?

Well, let's get the most obvious drawback out of the way. For a user who doesn't know what you're doing, it makes no sense. Doing UTF-8-to-16 conversion by using a path type is bonkers, and should be seen immediately as a code smell. It's the kind of awful hack you do when you are needlessly averse to just downloading a simple library that would do it correctly.

Also, it doesn't have to work. path is meant for storing... paths. Hence the name. Specifically, they're meant for storing paths in a way easily consumed by the filesystem in question. As such, the string stored in a path can have any limitations that the filesystem wants to put on it, outside of a small plethora of things the C++ standard requires it to do.

For example, if the filesystem is case-insensitive (or even just ASCII-case-insensitive), it is a legitimate implementation to have it just case-convert all strings to lowercase when they are stored in a path . Or to case-convert them when you extract them from a path . Or anything of the like.

path can convert all of your \ s into / s. Or your : s into / 's. Or any other implementation-dependent tricks it wants to do.

If you're afraid of using a deprecated facility, just download a simple UTF-8/16 converting library. Or write one yourself; it isn't that difficult.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM