简体   繁体   中英

PHP PCRE regular expression

In LaTeX, the expression \\o{a}{b} means the operator 'o' takes two arguments a and b. LaTeX also accepts \\o{a} , and in this case treats the second argument as the empty string.

Now I try to match the regex \\\\\\\\o\\{([\\s\\S]*?)\\}\\{([\\s\\S]*?)\\} against the string \\o{a}\\o{a}{b} . It mistakes the whole string to be a match when it isn't. (The correct interpretation of this string is that the substring \\o{a}{b} is the only match.) The point is I need to know how to tell PHP to recognise that if there is something else than { following the first }, then it is not a match.

How should I do that?

Edit : Arguments of an operator are allowed to contain the symbols \\ , { and } . But in this case the reason the whole string is not a match is because the curly brackets in a}\\o{a do not conform to LaTeX rules (eg { must come before } ), so that a}\\o{a cannot be an argument of an operator...

Edit2 : On the other hand, \\o{{a}}{b} should be a match as {a} is a valid argument.

I suggest something like this:

$s = '\\o{a}\\o{a}{b}';
echo "$s\n";  # Check string
preg_match('~\\\o(\{(?>[^{}\\\]++|(?1)|\\\.)+\}){2}~', $s, $match);
print_r($match);

ideone demo

The regex:

  • uses recursion to deal with nested braces,
  • uses backslashes too ( [^{}\\\\\\] and \\\\\\. ) to avoid taking literal braces for syntactical braces.

\\\o             # Matches \o
(                # Recursive group to be
  \{             # Matches {
  (?>            # Begin atomic group (just a group that makes the regex faster)
     [^{}\\\]++  # Any characteres except braces and backslash
  |
     (?1)        # Or recurse the outer group
  |
     \\\.        # Or match an escaped character
  )+             # As many times as necessary
  \}             # Closing brace
){2}             # Repeat twice

The problem with your current regex is that once this part matched \\\\\\\\o\\{([\\s\\S]*?) , it will try to look for the next \\} that is coming, and there, it matters not whether you are using a lazy quantifier or a greedy one. You need to somehow prevent it to match } before the actual \\} comes in the regex.

That's why you have to use [^{}] and since you actually can have nested braces inside, that's the ideal situation to use recursion.

to deal with possible nested curly brackets you need to use the recursion feature:

$pattern = <<<'EOD'
~
\\o({(?>[^{}]+|(?-1))*}){2}
~x
EOD;

where (?-1) is a reference to the subpattern of the last capturing group.

I would guess you need to look into using anchors ^ and $

$pattern = '/^\\o\{.*\}(\{.*\})?$/';

I don't know what you consider aceptable values for a and b , so you can replace .* with an appropriate class here.

This allows either \\0{a} or \\o{a}{b} formats. To match only \\o{a}{b} modify to this:

$pattern = '/^\\o\{.*\}\{.*\}$/';

Based on your last edit, I would suggest replacing .* in above with [^{]* as noted in other answers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM