Using the utfcpp
lib, one could split a string ( '哈哈哈'
) encoded in utf8
into several uint32_t
s (or symbols (21704, 21704, 21704)
) which act like char
s for std::string
.
In this situation, what's the best solution store the uint32_t
('character') sequences (as a 'string')?
For example, putting (21704, 21704, 21704)
into a vector<uint32_t>
will require iterating the vector for 'string comparison', which seems slower than the real version of std::string
.
Thanks in advance.
Either use std::wstring
or your own brew std::basic_string<uint32_t>
.
This would let you use their operators and functions to manipulate such objects.
Modern versions of C++ come with char16_t
and char32_t
. They should be prefered to uintxx_t
types because clause 24.2 Character traits [char.traits] mandates the definition of specialization of char_traits
for it:
This subclause defines requirements on classes representing character traits, and defines a class template
char_traits<charT>
, along with four specializations,char_traits<char>
,char_traits<char16_t>
,char_traits<char32_t>
, andchar_traits<wchar_t>
, that satisfy those requirements.
This even allows immediate access to a basic_string
specialization: 24.3 String classes [string.classes] says
The header
<string>
defines the basic_string class template for manipulating varying-length sequences of char-like objects and four typedef-names,string
,u16string
,u32string
, andwstring
, that name the specializationsbasic_string<char>
,basic_string<char16_t>
,basic_string<char32_t>
, andbasic_string<wchar_t>
, respectively.
Unfortunately, when it comes to direct io no such specializations exists out of the box for basic_stream<char32_t>
, but UTF8 locales should have converters between char32_t
and char
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.