简体   繁体   中英

Unicode and std::string in C++

If I write a random string to file in C++ consisting of some unicode characters, I am told by my text editor that I have not created a valid UTF-8 file.

// Code example
const std::string charset = "abcdefgàèíüŷÀ";
file << random_string(charset); // using std::fstream

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

random_string is likely to be the culprit; I wonder how it's implemented. If your string is indeed UTF-8-encoded and random_string looks like

std::string random_string(std::string const &charset)
{
    const int N = 10;
    std::string result(N);
    for (int i=0; i<N; i++)
        result[i] = charset[rand() % charset.size()];
    return result;
}

then it will take random char s from charset , which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. See Wikipedia and RFC 3629 .

The solution might be to transform to and from UTF-32 in random_string . I believe wchar_t and std::wstring use UTF-32 on Linux. UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane .

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

You are correct that std::string is encoding agnostic. It simply holds an array of char elements. How these char elements are interpreted as text depends on the environment. If your locale is not set to some form of Unicode (ie UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode.

Are you sure your string literal "abcdefgàèíüŷÀ" is actually Unicode and not, for example, Latin-1 ? (ISO-8859-1 or possible Windows-1252)? You need to determine what locale your platform is currently configured to use.

-----------EDIT-----------

I think I know your problem: some of those Unicode characters in your charset string literal, like the accented character "À", are two-byte characters (assuming a UTF-8 encoding). When you address the character-set string using the [] operator in your random_string function, you are returning half of a Unicode character. Thus the random-string function creates an invalid character string.

For example, consider the following code:

std::string s = "À";
std::cout << s.length() << std::endl;

In an environment where the string literal is interpreted as UTF-8, this program will output 2 . Therefore, the first character of the string ( s[0] ) is only half of a Unicode character, and therefore not valid. Since your random_string function is addressing the string by single bytes using the [] operator, you're creating invalid random strings.

So yes, you need to use std::wstring , and create your charset string-literal using the L prefix.

In your code sample, the std::string charset stores what you write . That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text.

UTF-8 is just a coding scheme in which different chars use different byte sizes. However, if you use a UTF-8 editor, it will codify, say 'ñ' with two bytes, and , when you write it to file, it will have that two bytes (being again UTF-8 compliant).

The problem may be the editor you used to create the source C++ file. It may use latin1 or some other encoding.

To write UTF-8, you need to use a codecvt facet like this one . An example of how to use it can be seen here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM