简体   繁体   中英

What is a safe PCRE regex delimiter to use on HTML5 pattern input element attribute?

It seems to be that the HTML5 spec (and therefore ECMA262) allows <input type="text" pattern="[0-9]/[0-9]" /> to match the string '0/0' even though the forward slash is not escaped. Web applications like Drupal would like to provide server-side validation for browsers that don't support HTML5 with something like:

<?php
preg_match('/^(' . $pattern . ')$/', $value);
?>

Unfortunately the string '[0-9]/[0-9]' is not a valid PRCE regex. It appears that most if not all HTML5-capable browser support both pattern="[0-9]/[0-9]" and pattern="[0-9]\\/[0-9]" which begs the question - what can we use as a delimiter to run this pattern against Perl-style regex?

We've filed a bug report against the W3C spec but are the browsers wrong here? Does the HTML5 spec need to be clarified? Is there a workaround we can use in PHP?

I recomend using "\\xFF" byte as pattern delimiter, because it is not allowed in UTF-8 string, so we can be sure it will not occur in the pattern. And because preg_match does not understand UTF-8, it will cause no trouble.

Example: preg_match("\\xFF$pattern\\$\\xFFADmsu", $subject);

Please note ADmsu modifiers and adding $ . The u modifier requires valid UTF-8 bytes only in the pattern, but not in delimiters around.

It is a valid regex if you use # instead of / for the delimiter. Example:

preg_match('#^('.$pattern.')$#', $value);

One of the problems with PCRE is that almost any delimiter is legal for the start and end markers, depending on what makes the rest of the escaping easier. So #foo# is legal, /foo/ is legal, !foo! is legal (I think), etc. Undelimited regex, I'd say, are extremely dangerous for exactly that reason. That sounds like an HTML5 spec bug that it doesn't specify.

Maybe in PHP, scan the string and pick a delimiter from a whitelist that is not present in the string? (Eg, if there's no / use that, if there is use #, if that's there use %, etc.)

I think chr(0) would work just fine. Edit: no. But chr(1) does work.

Just enclose it in brackets or parentheses (yes, that's strange!):

<?php
preg_match('(^' . $pattern . '$)', $value);
?>

The manual states that you can use all corresponding pairs: http://php.net/manual/en/regexp.reference.delimiters.php

Not easy at first, but it clearly deals with ANY character you may use in between. For example '(^(foo|bar)$)' works as the final regular expression: ^(foo|bar)$ , without any potentially risky escapes.

Given that a PHP application (Drupal in this case) is generating the input field, it seems like a workaround would be to do something along the lines of:

$pattern = '[0-9]/[0-9]';
...
$cleanPattern = preg_replace('/\//', '\\/', $pattern);
preg_match('/' . $cleanPattern . '/', $subject, $matches);

I couldn't think of a case where this wouldn't work, with / being used as a literal in the expression.

The HTML5 spec defers to ECMA262 for the legal pattern specification:

If specified, the attribute's value must match the JavaScript Pattern production. [ ECMA262 ]

Since there is BNF defined in ECMA262, a full parser (instead of using PCRE) seems like the safest approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM