简体   繁体   English

使用Regex匹配嵌套模式(使用PHP的递归)

[英]Matching nested Patterns with Regex (using PHP's recursion)

I am currently trying to write a regular expression in PHP that allows me to match a specific pattern containing itself indefinetely nested. 我目前正在尝试用PHP编写一个正则表达式,该表达式可让我匹配包含不确定嵌套的自身的特定模式。 I know that per default regular expressions are not capable of doing that, but PHP's Recursive Patterns ( http://php.net/manual/de/regexp.reference.recursive.php ) should make it possible. 我知道默认情况下,正则表达式无法做到这一点,但是PHP的递归模式( http://php.net/manual/de/regexp.reference.recursive.php )应该可以实现。

I have nested structures like this: 我有这样的嵌套结构:

<a=5>
    <a=3>
        Foo
        <b>Bar</b>
    </a>
    Baz
</a>

Now I want to match the content of the outmost tag. 现在,我想匹配最外层标签的内容。 In order to correctly match up the first opening tag with the last closing tag, I need PHP's recursion item (?R) . 为了正确匹配第一个开始标记和最后一个结束标记,我需要PHP的递归项目(?R)

I tried a pattern like this: 我尝试了这样的模式:

/<a=5>((?R)|[^<]|<\/?[^a]|<\/?a[a-zA-Z0-9-])*<\/a>/s

Which basically means <a=5> , followed by as many as possible of the following, followed by </a> : 基本上表示<a=5> ,其次是以下尽可能多的,然后是</a>

  • another tag (recursively) 另一个标签(递归)
  • any not-opening-tag character 任何未打开标签的字符
  • any opening tag, followed by an optional slash, not followed by an "a" 任何开头的标签,后跟一个可选的斜杠,而不是一个“ a”
  • the before WITH an a, but not finished (followed by at least 1 more character) 之前带有a,但未完成(之后还有至少1个字符)

The last 2 cases could be just one case [tag not namend "a"], but I heard this should be avoided in regular expressions, because it needs lookarounds and would have bad performance. 最后2种情况可能只是一种情况[标签不是namend“ a”],但是我听说应该在正则表达式中避免这种情况,因为它需要环顾四周并且性能会很差。

However, I see no mistake in my RegEx, but it does not match the given string. 但是,我在RegEx中看不到任何错误,但是它与给定的字符串不匹配。 I want the following match: 我想要以下比赛:

    <a=3>
        Foo
        <b>Bar</b>
    </a>
    Baz

Here's a link to play around with the RegEx: https://www.regex101.com/r/lO1wA6/1 这是使用RegEx的链接: https : //www.regex101.com/r/lO1wA6/1

You can use this regex to match what you want (the regex placed in a string literal for sake of convenience): 您可以使用此正则表达式来匹配所需的内容(为了方便起见,将正则表达式放在字符串文字中):

'~<a=5>(<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>|[^<>]++)*</a>~'

Here is a break down of the regex above: 这是上面的正则表达式的分解:

<a=5>
(
  <([a-zA-Z0-9]+)[^>]*>
  (?1)*
  </\2>
  |
  [^<>]++
)*
</a>

The first part <([a-zA-Z0-9]+)[^>]*>(?1)*</\\2> matches pair of matching tags and all its content. 第一部分<([a-zA-Z0-9]+)[^>]*>(?1)*</\\2>匹配一对匹配的标记及其所有内容。 It assumes that the name of the tag consists of the characters [a-zA-Z0-9] . 假定标签名称由字符[a-zA-Z0-9] The name of the tag is captured ([a-zA-Z0-9]+) and backreference when matching the closing tag </\\2> . 匹配结束标记</\\2>时,将捕获标记名称([a-zA-Z0-9]+)和向后引用。

The second part [^<>]++ matches whatever else outside the tags. 第二部分[^<>]++与标记之外的其他任何内容匹配。 Note that there is no handling of quoted string, so depending on your input it may not work. 请注意,不对带引号的字符串进行处理,因此根据您的输入,它可能不起作用。

Then back to the routine call which recursively calls the first capturing group. 然后返回到例程调用,该例程递归地调用第一个捕获组。 You would notice that a tag can contain 0 or more instances of other tags or non-tag contents. 您会注意到,一个标签可以包含0个或多个其他标签或非标签内容的实例。 Due to the way the regex is written, this property is also shared by the outer most <a=5>...</a> pair. 由于正则表达式的编写方式,最外面的<a=5>...</a>对也共享此属性。

Demo on regex101 regex101上的演示

try this: 尝试这个:

PHP PHP

$re = "/(<[^\\/>]+(\\/?)>)*([^<]+)(<\\/\\w+>)*/m";
$str = "<a=5>\n <a=3>\n Foo\n <b/>Bar</b>\n </a>\n Baz\n</a>";

preg_match_all($re, $str, $matches);
var_dump($matches);
// here  

 $matches[1];  //for open tag array
 $matches[2];  //for single tag mark array by ( />)
 $matches[3];  //for inner data array
 $matches[4];  //for close tag array

output 产量

array (size=5)
  0 => 
    array (size=5)
      0 => string '<a=5>
 ' (length=7)
      1 => string '<a=3>
 Foo
 ' (length=12)
      2 => string '<b/>Bar</b>' (length=11)
      3 => string '
 </a>' (length=6)
      4 => string '
 Baz
</a>' (length=10)
  1 => 
    array (size=5)
      0 => string '<a=5>' (length=5)
      1 => string '<a=3>' (length=5)
      2 => string '<b/>' (length=4)
      3 => string '' (length=0)
      4 => string '' (length=0)
  2 => 
    array (size=5)
      0 => string '' (length=0)
      1 => string '' (length=0)
      2 => string '/' (length=1)
      3 => string '' (length=0)
      4 => string '' (length=0)
  3 => 
    array (size=5)
      0 => string '
 ' (length=2)
      1 => string '
 Foo
 ' (length=7)
      2 => string 'Bar' (length=3)
      3 => string '
 ' (length=2)
      4 => string '
 Baz
' (length=6)
  4 => 
    array (size=5)
      0 => string '' (length=0)
      1 => string '' (length=0)
      2 => string '</b>' (length=4)
      3 => string '</a>' (length=4)
      4 => string '</a>' (length=4)

Live Demo 现场演示

OR 要么

    $re = "/(<[^\\/>]+\\/?>)*([^<]+)(<\\/\\w+>)*/m";
    $str = "<a=5>fff\n <a=3>\n Foo\n <b/>Bar</b>\n </a>\n Baz\n</a>";

    preg_match_all($re, $str, $matches);
    //var_dump($matches);
    $md="";
    $c=count($matches[1]);
    foreach($matches[1] as $k=>$v){
        if($k!=0){
            $md.=$v.$matches[2][$k].$matches[3][$k];
        }
        else if ($c!=$k+1){
            $md.=$matches[2][$k].$matches[3][$k];
        }
    }   
var_dump($md);

Live 生活

Output 产量

 string 'fff
 <a=3>
 Foo
 <b/>Bar</b>
 </a>
 Baz
</a>' (length=44)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM