简体   繁体   中英

UTF-8 String Iterators

I am trying to write a Unicode-supported cross-platform application. I am using the library UTF8-C++ ( http://utfcpp.sourceforge.net/ ) but I am having trouble iterating through a string:

string s1 = "Добрый день";
utf8::iterator<string::iterator> iter(s1.begin(), s1.begin(), s1.end());

for(int i = 0; i < utf8::distance(s1.begin(), s1.end()); i++, ++iter)
{
    cout << (*iter);
}

The above code, when redirected to a UTF-8 formatted text file, produces the following output:

6 3 6 3 6 3 6 3 6 3 6 3 3 2 6 3 6 3 6 3 6 3 

How can I get the content of s1 to appear in the file properly?

You need to ensure that the string is being initialized with the correct data, and then that the iterator is producing the correct values.

You're using VS2010, so there's a bit of a problem with string literals. C++ implementations have an 'execution character set' to which they convert character and string literals from the 'source character set'. Visual Studio does not support UTF-8 as an execution character set, and therefore will not intentionally produce a UTF-8 encoded string literal.

You can get one by tricking the compiler, or by using hex escapes. Also instead of getting a UTF-8 string literal you could just get a wide string containing the correct data and then convert it at runtime to UTF-8.


edit: More recent versions of Visual Studio do now have ways to get UTF-8 string literals. Visual Studio 2015 now supports C++11's UTF-8 string literals. In Visual Studio 2015 Update 2 you can also use the compiler flags /execution-charset:utf-8 or /utf-8.


Tricking the compiler

If you save the source code as 'UTF-8 without signature' then the compiler will think that the source encoding is the system locale encoding. VS always uses the system locale encoding as the execution encoding. So when it thinks the source and execution encodings are the same it will not perform any conversion and your source bytes, which will actually be UTF-8, will be used directly for the string literal thus producing a UTF-8 encoded string literal. (note that this breaks the conversion done for wide character and string literals.)

Hex escapes

Hex escape codes let you manually insert code units (bytes in this case) of any value into a string literal. You can manually determine the UTF-8 encoding you want and then insert those values into the string literal.

std::string s1 = "\xd0\x94\xd0\xbe\xd0\xb1\xd1\x80\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c";

UTF-8 string literal prefix

C++11 specifies a prefix that creates a UTF-8 string literal regardless of the execution encoding, however Visual Studio does not implement this yet. This looks like:

string s1 = u8"Добрый день";

It requires that the compiler know and use the correct source encoding (and therefore that the source encoding support the desired string). The compiler then does the conversion from the source encoding to UTF-8 instead of to the execution encoding. When Visual Studio supports this feature you'll probably want to save your source code as 'UTF-8 with signature.' (Again, VS depends on the signature to identify UTF-8 source.)


After you have a UTF-8 string then, assuming the UTF-8 iterator works, your example code should produce the correct 11 code points and I think the output text should look like:

104410861073108810991081321076107710851100

Insert some spaces to make it readable and you can verify that you're getting the right values:

1044 1086 1073 1088 1099 1081 32 1076 1077 1085 1100

Or make it hex and add the Unicode prefix:

U+0414 U+043e U+0431 U+0440 U+044b U+0439 U+0020 U+0434 U+0435 U+043d U+044c

If you actually want to produce a UTF-8 encoded output file then you shouldn't be using the utf-8 iterator anyway.

string s1 = "Добрый день";
std::cout << s1;

When the output is redirected to a file then the file will contain the UTF-8 encoded data:

Добрый день

I don't understand why your actual output currently contains a bunch of extra spaces, but it looks like the actual numbers that are being accessed are:

63 63 63 63 63 63 32 63 63 63 63

63 is the ascii code for '?' and 32 is the ascii code for a space; ?????? ???? . So you are clearly suffering from VC++'s conversion of the string literal to the system locale encoding.

Answer updated. Use wstring (best given VS2010 I think) to store a UTF16 string, convert to UTF8, and output.

This works for me when I view in a UTF8 compatible editor (Scite).

    std::wstring s1 = L"Добрый день";
    std::vector<unsigned char> UTF8;

    utf8::utf16to8( s1.begin(), s1.end(), std::back_inserter( UTF8 ) );

    for( auto It = UTF8.begin() ; It < UTF8.end() ; ++It )
    {
        std::cout << (*It);
    }

I don't think there's a way in VS2010 to have a UTF8 literal or string object, UTF16 (wstring) I think is your best bet internally, then use the UTF8 library to convert to/from UTF8 when export to files/network, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM