简体   繁体   中英

C++ UTF-8 actual string length

Is there any native (cross platform) C++ function in any of standard libraries which returns the actual length of std::string ?

Update: as we know std::string.length() returns the number of bytes not the number of characters. I already have a custom function which returns the actual one, but I'm looking for an standard one.

codecvt ought to be helpful, the Standard provides implementations for UTF-8, for example codecvt_utf8<char32_t>() would be appropriate in this case.

Probably something like:

wstring_convert< codecvt_utf8<char32_t>, char32_t >().from_bytes(the_std_string).size()

Actual length is the number of bytes. There is very little meaning to counting codepoints. You may though want to count other things like grapheme clusters.

See more about different kind of string lengths in http://utf8everywhere.org

There is no way to do that in C/C++, without 3rd party libraries. Even if you convert to char32_t, you will get code points, not characters.

A code point does not match the user perception of a character, because of things like decompose formats, ligatures, variation selectors.

The closest available construct to a "user character" is a "grapheme cluster" (see http://www.unicode.org/reports/tr29/ )

Your best cross-platform option is ICU4C ( http://site.icu-project.org/ )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM