简体   繁体   中英

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)

UTF-8 is a variable length encoding ; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.

If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.

However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)

Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile . PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.

You need to be careful about your patterns and about the text your going to match.

As an example, given the expression ab :

"axb" matches 
"aèb" does NOT match

The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.

So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.

You can try to match a single UTF-8 encoded "character" with something like:

([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)

but this assumes that the text is encoded correctly (and, frankly, I never tried it).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM