简体   繁体   中英

Does `std::wregex` support utf-16/unicode or only UCS-2?

With the regex library was introduced into the standard library.

On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally when interfacing with the system/platform (eg. CreateFileW ).

However it seems that std::regex isn't or does not support it, so I'm wondering whether std::wregex supports or just ?

I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.

The question is:

Is std::wregex representing when wchar_t has size of 2 ?

C++ standard doesn't enforce any encoding on std::string and std::wstring . They're simply a series of CharT . Only std::u8string , std::u16string and std::u32string have defined encoding

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT . Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex . So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale )

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex . You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

See also

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM