With c++11 the regex library was introduced into the standard library.
On the Windows/MSVC platform wchar_t
has size of 2 (16 bit) and wchar_t*
is normally utf-16 when interfacing with the system/platform (eg. CreateFileW
).
However it seems that std::regex
isn't utf-8 or does not support it, so I'm wondering whether std::wregex
supports utf-16 or just ucs2 ?
I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.
The question is:
Is std::wregex
representing ucs2 when wchar_t
has size of 2 ?
C++ standard doesn't enforce any encoding on std::string
and std::wstring
. They're simply a series of CharT
. Only std::u8string
, std::u16string
and std::u32string
have defined encoding
Similarly std::regex
and std::wregex
also wrap around std::basic_string
and CharT
. Their constructors accept std::basic_string
and the encoding being used for std::basic_string
will also be used for std::basic_regex
. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex
and std::string
will be UTF-8 (yes, modern Windows does support UTF-8 locale )
On Windows std::wstring
uses UTF-16 so std::wregex
also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t>
and std::basic_regex<char16_t>
for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex . You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
See also
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.