在C中使用带有unicode字符串的正则表达式

Question

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8).我目前在 unicode 字符串上使用正则表达式，但我只需要匹配 ASCII 字符，从而有效地忽略所有 unicode 字符，直到现在 regex.h 中的函数工作正常（我在 linux 上，因此编码为 utf8）。 But can someone confirm if its really ok to do so?但是有人可以确认这样做是否真的可以吗？ Or do I need a regex library on Unicode (like ICU?)或者我需要一个 Unicode 正则表达式库（比如 ICU？）

Answer 1

UTF-8 is a variable length encoding ; UTF-8 是一种变长编码； some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character.有些字符是 1 个字节，有些是 2 个，其他的则是 3 或 4 个字节。您现在知道每个字符的前缀可以读取许多字节。 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes. 0 表示 1 个字节，110 表示 2 个字节，1110 表示 3 个字节，11110 表示 4 个字节。

If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.如果您尝试将 UTF-8 字符串读取为 ASCII 或任何其他固定宽度的编码，事情会变得非常错误......除非该 UTF-8 字符串只包含 1 个字节的字符，在这种情况下它与 ASCII 匹配。

However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later?然而，由于 UTF-8 中没有字节包含空字节，并且没有任何额外的字节可以与 ASCII 混淆，如果你真的只匹配 ASCII，你可能能够逃脱它......但我不会不推荐它，因为有比 POSIX 更好的正则表达式选项，它们很容易使用，为什么在你的代码中留下一个隐藏的编码炸弹让一些傻瓜稍后处理？ (Note: that sucker may be you) （注意：那个傻瓜可能是你）

Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE).相反，请使用支持 Unicode 的正则表达式库，如 Perl Compatible Regular Expressions (PCRE)。 PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile . PCRE通过将PCRE2_UTF标志传递给PCRE2_UTF来pcre2_compile 。 PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. PCRE regex 语法比 POSIX regex 更强大、更广为理解，并且 PCRE 具有更多的特性。 And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions. PCRE 带有 Gnome Lib ，它本身提供了非常方便的 C 函数的盛宴。

Answer 2

You need to be careful about your patterns and about the text your going to match.你需要小心你的模式和你要匹配的文本。

As an example, given the expression ab :例如，给定表达式ab ：

"axb" matches 
"aèb" does NOT match

The reason is that è is two bytes long when UTF-8 encoded but .原因是当 UTF-8 编码时è是两个字节长，但是. would only match the first one.只会匹配第一个。

So as long as you only match sequences of ASCII characters you're safe.因此，只要您只匹配 ASCII 字符序列，您就是安全的。 If you mix ASCII and non ASCII characters, you're in trouble.如果你混合使用 ASCII 和非 ASCII 字符，你就有麻烦了。

You can try to match a single UTF-8 encoded "character" with something like:您可以尝试将单个 UTF-8 编码的“字符”与以下内容进行匹配：

([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)

but this assumes that the text is encoded correctly (and, frankly, I never tried it).但这假设文本编码正确（坦率地说，我从未尝试过）。

在C中使用带有unicode字符串的正则表达式

问题描述

2 个解决方案

解决方案1
3 2016-12-12 05:16:00

解决方案2
0 2020-11-15 18:55:49

在C中使用带有unicode字符串的正则表达式

问题描述

2 个解决方案

解决方案1 3 2016-12-12 05:16:00

解决方案2 0 2020-11-15 18:55:49

解决方案1
3 2016-12-12 05:16:00

解决方案2
0 2020-11-15 18:55:49