简体   繁体   English

这个正则表达式会是多字节安全的吗?

[英]Would this regex be multibyte safe?

I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point: 我正在使用以下正则表达式来检查图像文件名仅包含字母数字,下划线,连字符,小数点:

preg_match('!^[\w.-]*$!',$filename) 

This works ok. 这样就可以了。 But I have concerns about multibyte characters. 但是我担心多字节字符。 Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok? 我应该专门处理它们以防止不确定的错误,还是该正则表达式拒绝mb文件名呢?

PHP does not have "native" support for multibyte characters; PHP不支持多字节字符。 you need to use the "mbstring" extension Docs (which may or may not be available). 您需要使用“ mbstring”扩展名文档 (可能可用或可能不可用)。 Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. 此外,似乎没有办法像这样创建“多字节字符字符串”-而是选择使用特殊的“ mbstring”函数将本机字符串视为多字节字符字符串。 In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually. 换句话说,PHP字符串不知道其自身的字符编码-您必须手动对其进行跟踪。

You may be able to get away with it so long as you use UTF-8 (or similar) encoding. 只要您使用UTF-8(或类似格式)编码,您就可以摆脱它。 UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f ), so PHP will probably treat them just like any other character. UTF-8总是将多字节字符编码为“高”字节(例如, ß编码为0xcf 0x9f ),因此PHP可能会像对待其他任何字符一样对待它们。 You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22 , the "double-quote" symbol. 您将无法使用可能将多字节字符编码为“特殊” PHP字节的编码,例如0x22 (“双引号”符号)。

The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg Docs , mb_eregi Docs , mb_ereg_replace Docs and mb_eregi_replace Docs . PHP中唯一知道如何处理多个字符集范围内的特定多字节字符的正则表达式函数是mb_ereg Docsmb_eregi Docsmb_ereg_replace Docsmb_eregi_replace Docs

PCRE based regular expression functions like preg_match Docs support UTF-8 by using the u -modifier (PCRE8) Docs . 基于PCRE的正则表达式函数(例如preg_match Docs)通过使用u -modifier(PCRE8) Docs支持UTF-8。

But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. 但是,当然,如上所述,PHP字符串不知道其自身的编码,因此您首先需要使用mb_regex_encoding函数来指示“ mbstring”库。 Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself. 请注意,该函数指定您要匹配的字符串的编码,而不是包含正则表达式本身的字符串的编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM