简体   繁体   中英

Convert utf8 wstring to string on windows in C++

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());

but unfortunatelly the result of the following text is this:

"c:\\git\\myproject\\bin\\árvíztűrőtükörfúrógép" -> "c:\\git\\myproject\\bin\\árvÃztűrÅ'tükörfúrógép"

What do I do wrong?

#include <string>
#include <locale>
#include <codecvt>

int main()
{
    // wide character data
    std::wstring wstr =  L"árvíztűrőtükörfúrógép";

    // wide to UTF-8
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
    std::string str = conv1.to_bytes(wstr);
}

I was checking the value of the variable in visual studio debug mode.

The code is fine.

You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.

I was checking the value of the variable in visual studio debug mode.

Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.

So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.

Nothing is wrong here.

If you were to output the string like std::cout << str , and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.

You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8 ; or, in the Command window, type ? &str[0],s8 ? &str[0],s8 . These techniques are explored by Giovanni Dicanio in his article " What's Wrong with My UTF-8 Strings in Visual Studio? ".


It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM