[英]Unicode and std::string in C++
If I write a random string to file in C++ consisting of some unicode characters, I am told by my text editor that I have not created a valid UTF-8 file. 如果我用C ++写一个随机字符串来包含一些unicode字符,我的文本编辑器告诉我,我没有创建一个有效的UTF-8文件。
// Code example
const std::string charset = "abcdefgàèíüŷÀ";
file << random_string(charset); // using std::fstream
What can I do to solve this? 我该怎么做才能解决这个问题? Do I have to do lots of additional manual encoding?
我是否需要进行大量额外的手动编码? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?
我理解它的方式,std :: string不关心编码,只关心字节,所以当我传递一个unicode字符串并将其写入文件时,肯定该文件应包含相同的字节并被识别为UTF- 8编码文件?
random_string
is likely to be the culprit; random_string
很可能是罪魁祸首; I wonder how it's implemented. 我想知道它是如何实现的。 If your string is indeed UTF-8-encoded and
random_string
looks like 如果您的字符串确实是UTF-8编码并且
random_string
看起来像
std::string random_string(std::string const &charset)
{
const int N = 10;
std::string result(N);
for (int i=0; i<N; i++)
result[i] = charset[rand() % charset.size()];
return result;
}
then it will take random char
s from charset
, which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. 然后它会采取随机
char
期从charset
,这在UTF-8(如其他海报指出)不是Unicode代码点,但简单的字节。 If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. 如果它从UTF-8多字节字符的中间选择一个随机字节作为第一个字节(或者将其放在7位ASCII兼容字符之后),那么您的输出将不是有效的UTF-8。 See Wikipedia and RFC 3629 .
请参阅Wikipedia和RFC 3629 。
The solution might be to transform to and from UTF-32 in random_string
. 解决方案可能是在
random_string
转换为UTF-32和从UTF-32 random_string
。 I believe wchar_t
and std::wstring
use UTF-32 on Linux. 我相信
wchar_t
和std::wstring
在Linux上使用UTF-32。 UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane . 只要您保持在基本多语言平面内,UTF-16也是安全的。
What can I do to solve this?
我该怎么做才能解决这个问题? Do I have to do lots of additional manual encoding?
我是否需要进行大量额外的手动编码? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?
我理解它的方式,std :: string不关心编码,只关心字节,所以当我传递一个unicode字符串并将其写入文件时,肯定该文件应包含相同的字节并被识别为UTF- 8编码文件?
You are correct that std::string
is encoding agnostic. 你是正确的
std::string
编码不可知。 It simply holds an array of char
elements. 它只包含一个
char
元素数组。 How these char
elements are interpreted as text depends on the environment. 如何将这些
char
元素解释为文本取决于环境。 If your locale is not set to some form of Unicode (ie UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode. 如果您的语言环境未设置为某种形式的Unicode(即UTF-8或UTF-16),那么当您输出字符串时,它将不会显示/解释为Unicode。
Are you sure your string literal "abcdefgàèíüŷÀ" is actually Unicode and not, for example, Latin-1 ? 你确定你的字符串文字“abcdefgàèíüŷÀ” 实际上是 Unicode,而不是例如Latin-1吗? (ISO-8859-1 or possible Windows-1252)?
(ISO-8859-1或可能的Windows-1252)? You need to determine what locale your platform is currently configured to use.
您需要确定您的平台当前配置使用的区域设置。
-----------EDIT----------- - - - - - -编辑 - - - - - -
I think I know your problem: some of those Unicode characters in your charset
string literal, like the accented character "À", are two-byte characters (assuming a UTF-8 encoding). 我想我知道你的问题:
charset
字符串文字中的一些Unicode字符,如重音字符“À”,是双字节字符(假设是UTF-8编码)。 When you address the character-set string using the []
operator in your random_string
function, you are returning half of a Unicode character. 使用
random_string
函数中的[]
运算符处理字符集字符串时,将返回Unicode字符的一半 。 Thus the random-string
function creates an invalid character string. 因此,
random-string
函数创建无效的字符串。
For example, consider the following code: 例如,请考虑以下代码:
std::string s = "À";
std::cout << s.length() << std::endl;
In an environment where the string literal is interpreted as UTF-8, this program will output 2
. 在字符串文字被解释为UTF-8的环境中,此程序将输出
2
。 Therefore, the first character of the string ( s[0]
) is only half of a Unicode character, and therefore not valid. 因此,字符串的第一个字符(
s[0]
)只是Unicode字符的一半 ,因此无效。 Since your random_string
function is addressing the string by single bytes using the []
operator, you're creating invalid random strings. 由于
random_string
函数使用[]
运算符按单个字节寻址字符串,因此您将创建无效的随机字符串。
So yes, you need to use std::wstring
, and create your charset string-literal using the L
prefix. 所以是的,你需要使用
std::wstring
,并使用L
前缀创建你的charset string-literal。
In your code sample, the std::string charset
stores what you write . 在您的代码示例中,
std::string charset
存储您编写的内容 。 That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text. 也就是说,如果您使用UTF-8文本编辑器来编写它,那么您在文件输出中收到的内容就是UTF-8文本。
UTF-8 is just a coding scheme in which different chars use different byte sizes. UTF-8只是一种编码方案,其中不同的字符使用不同的字节大小。 However, if you use a UTF-8 editor, it will codify, say 'ñ' with two bytes, and , when you write it to file, it will have that two bytes (being again UTF-8 compliant).
但是,如果您使用UTF-8编辑器,它将编码,用两个字节说“ñ”, 并且 ,当您将其写入文件时,它将具有两个字节(再次符合UTF-8)。
The problem may be the editor you used to create the source C++ file. 问题可能是您用于创建源C ++文件的编辑器。 It may use latin1 or some other encoding.
它可能使用latin1或其他一些编码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.