[英]How to get the accurate length of a std::string?
I am trimming a long std::string
to fit it in a text container using this code. 我正在修剪长
std::string
以使用此代码将其放入文本容器中。
std::string AppDelegate::getTrimmedStringWithRange(std::string text, int range)
{
if (text.length() > range)
{
std::string str(text,0,range-3);
return str.append("...");
}
return text;
}
but in case of other languages like HINDI "हिन्दी"
the length of std::string
is wrong. 但是对于其他语言(如
HINDI "हिन्दी"
, std::string
的长度是错误的。
My question is how can i retrieve accurate length of the std::string in all test cases. 我的问题是如何在所有测试用例中检索std :: string的准确长度。
Thanks 谢谢
The length of std::string
is not "wrong"; std::string
的长度不是“错误”; you've simply misunderstood what it means. 您只是误解了它的含义。 A
std::string
stores bytes, not "characters" in your chosen encoding. 一个
std::string
用您选择的编码存储字节,而不是“字符”。 It gleefully has no knowledge of that layer. 它高兴地不知道该层。 As such, the length of
std::string
is the number of bytes it contains. 这样,
std::string
的长度就是它包含的字节数。
To count such "characters", you will need a library that supports analysis of your chosen encoding, whatever that is. 要计算此类“字符”,您将需要一个库来支持对所选编码的分析,无论是哪种编码。
Only if your chosen encoding is ASCII-compatible can you just count the bytes and be done with it. 仅当您选择的编码与ASCII兼容时,您才可以对字节进行计数并进行处理。
Assuming you're using UTF-8, you can convert your string to a simple (hah!) Unicode and count the characters. 假设您使用的是UTF-8,则可以将字符串转换为简单的(hah!)Unicode并计算字符数。 I grabbed this example from rosettacode .
我从rosettacode获取了这个示例。
#include <iostream>
#include <codecvt>
int main()
{
std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
std::cout << "Byte length: " << utf8.size() << '\n';
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}
As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8 . 如注释中所述,该长度将返回以utf8编码的字符串的字节数。 In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.
在这种多字节编码中,非ascii字符以2到6个字节编码,因此您的utf8字符串长度将比实际的unicode字母数长。
Solution 1 解决方案1
If you have many long strings, you can keep them in utf8. 如果您有很多长字符串,可以将其保留在utf8中。 The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary.
utf8编码使找到附加的多字节字符相对容易:它们全部以10xxxxxx二进制开头。 So count the number of such additional bytes, and substract this from the string length
因此,计算此类额外字节的数量,并从字符串长度中减去
cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;
Solution 2 解决方案2
If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes()
in the standard library to convert your string into a wstring. 如果需要的不仅仅是处理长度,还可以考虑在标准库中使用
wstring_convert::from_bytes()
将字符串转换为wstring。 The length of the wstring should be what you expect. wstring的长度应该是您所期望的。
wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;
Attention: wstring
on linux is based on 32 bits wchar_t
and one such wide char can contain all the unicode characeter set. 注意: Linux上的
wstring
基于32位wchar_t
而这样一个宽字符可以包含所有的Unicode字符集。 So this is perfect. 因此,这是完美的。 On windows however,
wchar_t
is only 16 bits, so some characters might still require multi-word encoding. 但是,在Windows上,
wchar_t
只有16位,因此某些字符可能仍需要多字编码。 Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. 幸运的是,所有印地文字符都在U + 0000到U + D7FF的范围内,可以在一个16位字上进行编码。 So it should be ok also .
所以也应该可以。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.