简体   繁体   English

如何测量非ASCII字符的正确大小?

[英]How to measure the correct size of non-ASCII characters?

In the following program, I'm trying to measure the length of a string with non-ASCII characters. 在下面的程序中,我试图用非ASCII字符来测量字符串的长度。

But, I'm not sure why the size() doesn't print the correct length when using non-ASCII characters. 但是,我不确定为什么size()在使用非ASCII字符时不会打印正确的长度。

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

Output: 输出:

Size of Hello is 5
Size of इंडिया is 18

Live demo Wandbox . 现场演示Wandbox

std::string::size returns the length in bytes, not in number of characters. std::string::size以字节为单位返回长度,而不是字符数。 Your second string uses an UNICODE encoding, so it may take several bytes per character. 您的第二个字符串使用UNICODE编码,因此每个字符可能需要几个字节。 Note that the same applies to std::wstring::size since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer ). 请注意,这同样适用于std::wstring::size因为它取决于编码(它返回宽字符的数量,而不是实际字符:如果使用UTF-16,它将匹配,但不一定适用于其他编码,更多在这个答案 )。

To measure the actual length (in number of symbols) you need to know the encoding in order to separate (and therefore count) the characters correctly. 要测量实际长度(符号数),您需要知道编码,以便正确分离(并因此计算)字符。 This answer may be helpful for UTF-8 for example (although the method used is deprecated in C++17). 例如,这个答案可能对UTF-8有帮助(尽管在C ++中使用的方法已被弃用17)。

Another option for UTF-8 is to count the number of first-bytes ( credit to this other answer ): UTF-8的另一个选项是计算第一个字节的数量( 归功于另一个答案 ):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

I have used std::wstring_convert class and got the correct length of the strings. 我使用了std :: wstring_convert类并获得了正确的字符串长度。

#include <string>
#include <iostream>
#include <codecvt>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn;
    auto sz = cn.from_bytes(s2).size();
    std::cout << "Size of " << s2 << " is " << sz << std::endl;
}

Live demo wandbox . 现场演示wandbox

Importance reference link here for more about std::wstring_convert 有关std::wstring_convert更多信息,请参阅此处的重要性参考链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM