简体繁体中英

Does `std::wregex` support utf-16/unicode or only UCS-2?

原文 2019-11-27 09:46:23 3 1 c++/ regex/ unicode/ encoding/ widechar

With c++11 the regex library was introduced into the standard library.

On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally utf-16 when interfacing with the system/platform (eg. CreateFileW ).

However it seems that std::regex isn't utf-8 or does not support it, so I'm wondering whether std::wregex supports utf-16 or just ucs2 ?

I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.

The question is:

Is std::wregex representing ucs2 when wchar_t has size of 2 ?

1 answers

C++ standard doesn't enforce any encoding on std::string and std::wstring . They're simply a series of CharT . Only std::u8string , std::u16string and std::u32string have defined encoding

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT . Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex . So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale )

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex . You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

Does std::wstring support UTF-16 and UTF-32 on Windows?

What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?

Why unicode char is stored as UTF-8 in std::string and UTF-16/32 in wchar_t?

C++ unicode UTF-16 encoding

Convert unicode codepoint to utf-16

Encode/Decode std::string to UTF-16

Passing the first character of a string into another string and using std::stoi to get the integer value, to test if it us UTF-8 or Unicode(UTF-16)

clang: converting const char16_t* (UTF-16) to wstring (UCS-4)

Write Unicode UTF-8 and UTF-16 data into a QByteArray

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question ASCII and UTF-8 (or UCS-2 and UTF-16) strings in the same C++ project Does std::wstring support UTF-16 and UTF-32 on Windows? What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? Why unicode char is stored as UTF-8 in std::string and UTF-16/32 in wchar_t? C++ unicode UTF-16 encoding Convert unicode codepoint to utf-16 Encode/Decode std::string to UTF-16 Passing the first character of a string into another string and using std::stoi to get the integer value, to test if it us UTF-8 or Unicode(UTF-16) clang: converting const char16_t* (UTF-16) to wstring (UCS-4) Write Unicode UTF-8 and UTF-16 data into a QByteArray

Related Tags

Does `std::wregex` support utf-16/unicode or only UCS-2?

Question

1 answers

solution1 1 ACCPTED 2019-11-27 13:03:03

solution1
1 ACCPTED 2019-11-27 13:03:03