简体   繁体   English

C ++中的Unicode和std :: string

[英]Unicode and std::string in C++

If I write a random string to file in C++ consisting of some unicode characters, I am told by my text editor that I have not created a valid UTF-8 file. 如果我用C ++写一个随机字符串来包含一些unicode字符,我的文本编辑器告诉我,我没有创建一个有效的UTF-8文件。

// Code example
const std::string charset = "abcdefgàèíüŷÀ";
file << random_string(charset); // using std::fstream

What can I do to solve this? 我该怎么做才能解决这个问题? Do I have to do lots of additional manual encoding? 我是否需要进行大量额外的手动编码? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file? 我理解它的方式,std :: string不关心编码,只关心字节,所以当我传递一个unicode字符串并将其写入文件时,肯定该文件应包含相同的字节并被识别为UTF- 8编码文件?

random_string is likely to be the culprit; random_string很可能是罪魁祸首; I wonder how it's implemented. 我想知道它是如何实现的。 If your string is indeed UTF-8-encoded and random_string looks like 如果您的字符串确实是UTF-8编码并且random_string看起来像

std::string random_string(std::string const &charset)
{
    const int N = 10;
    std::string result(N);
    for (int i=0; i<N; i++)
        result[i] = charset[rand() % charset.size()];
    return result;
}

then it will take random char s from charset , which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. 然后它会采取随机char期从charset ,这在UTF-8(如其他海报指出)不是Unicode代码点,但简单的字节。 If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. 如果它从UTF-8多字节字符的中间选择一个随机字节作为第一个字节(或者将其放在7位ASCII兼容字符之后),那么您的输出将不是有效的UTF-8。 See Wikipedia and RFC 3629 . 请参阅WikipediaRFC 3629

The solution might be to transform to and from UTF-32 in random_string . 解决方案可能是在random_string 转换为UTF-32和从UTF-32 random_string I believe wchar_t and std::wstring use UTF-32 on Linux. 我相信wchar_tstd::wstring在Linux上使用UTF-32。 UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane . 只要您保持在基本多语言平面内,UTF-16也是安全的。

What can I do to solve this? 我该怎么做才能解决这个问题? Do I have to do lots of additional manual encoding? 我是否需要进行大量额外的手动编码? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file? 我理解它的方式,std :: string不关心编码,只关心字节,所以当我传递一个unicode字符串并将其写入文件时,肯定该文件应包含相同的字节并被识别为UTF- 8编码文件?

You are correct that std::string is encoding agnostic. 你是正确的std::string编码不可知。 It simply holds an array of char elements. 它只包含一个char元素数组。 How these char elements are interpreted as text depends on the environment. 如何将这些char元素解释为文本取决于环境。 If your locale is not set to some form of Unicode (ie UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode. 如果您的语言环境未设置为某种形式的Unicode(即UTF-8或UTF-16),那么当您输出字符串时,它将不会显示/解释为Unicode。

Are you sure your string literal "abcdefgàèíüŷÀ" is actually Unicode and not, for example, Latin-1 ? 你确定你的字符串文字“abcdefgàèíüŷÀ” 实际上是 Unicode,而不是例如Latin-1吗? (ISO-8859-1 or possible Windows-1252)? (ISO-8859-1或可能的Windows-1252)? You need to determine what locale your platform is currently configured to use. 您需要确定您的平台当前配置使用的区域设置。

-----------EDIT----------- - - - - - -编辑 - - - - - -

I think I know your problem: some of those Unicode characters in your charset string literal, like the accented character "À", are two-byte characters (assuming a UTF-8 encoding). 我想我知道你的问题: charset字符串文字中的一些Unicode字符,如重音字符“À”,是双字节字符(假设是UTF-8编码)。 When you address the character-set string using the [] operator in your random_string function, you are returning half of a Unicode character. 使用random_string函数中的[]运算符处理字符集字符串时,将返回Unicode字符的一半 Thus the random-string function creates an invalid character string. 因此, random-string函数创建无效的字符串。

For example, consider the following code: 例如,请考虑以下代码:

std::string s = "À";
std::cout << s.length() << std::endl;

In an environment where the string literal is interpreted as UTF-8, this program will output 2 . 在字符串文字被解释为UTF-8的环境中,此程序将输出2 Therefore, the first character of the string ( s[0] ) is only half of a Unicode character, and therefore not valid. 因此,字符串的第一个字符( s[0] )只是Unicode字符的一半 ,因此无效。 Since your random_string function is addressing the string by single bytes using the [] operator, you're creating invalid random strings. 由于random_string函数使用[]运算符按单个字节寻址字符串,因此您将创建无效的随机字符串。

So yes, you need to use std::wstring , and create your charset string-literal using the L prefix. 所以是的,你需要使用std::wstring ,并使用L前缀创建你的charset string-literal。

In your code sample, the std::string charset stores what you write . 在您的代码示例中, std::string charset存储您编写的内容 That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text. 也就是说,如果您使用UTF-8文本编辑器来编写它,那么您在文件输出中收到的内容就是UTF-8文本。

UTF-8 is just a coding scheme in which different chars use different byte sizes. UTF-8只是一种编码方案,其中不同的字符使用不同的字节大小。 However, if you use a UTF-8 editor, it will codify, say 'ñ' with two bytes, and , when you write it to file, it will have that two bytes (being again UTF-8 compliant). 但是,如果您使用UTF-8编辑器,它将编码,用两个字节说“ñ”, 并且 ,当您将其写入文件时,它将具有两个字节(再次符合UTF-8)。

The problem may be the editor you used to create the source C++ file. 问题可能是您用于创建源C ++文件的编辑器。 It may use latin1 or some other encoding. 它可能使用latin1或其他一些编码。

To write UTF-8, you need to use a codecvt facet like this one . 要编写UTF-8,您需要使用像这样的codecvt方面。 An example of how to use it can be seen here . 这里可以看到如何使用它的一个例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM