简体   繁体   English

preg 匹配标签之间的文本,不包括中间的相同标签

[英]preg match text between tags excluding same tag in between

Well I know there several questions similar but could not find any with this specific case.好吧,我知道有几个类似的问题,但在这个特定案例中找不到任何问题。

I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.我拿了一个代码并根据我的需要对其进行了调整,但现在我发现了一个无法纠正的错误。

Code:代码:

$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match); 

  static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case      string(56) "<namespaces>
//      <namespace key="-2">Media</namespace>"
      $tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
      $tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';

      preg_match_all($tag_regex,
      $xml,
      $matches,
      PREG_OFFSET_CAPTURE);
      return $matches;
  }

As you can see, there is a bug if the tag is nested:如您所见,如果标签是嵌套的,则会出现一个错误:

<namespaces> <namespace key="-2">Media</namespace>

When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.什么时候它应该返回“媒体”,甚至是外部的'<namespaces>' ,然后是内部的。

I tried to add " <{$tag}[^\\>|^\\r\\n ]*?> ", ^\\s+ , changing the * to *?, and other few things that in best case turned to recognize only the bugged case.我尝试添加“ <{$tag}[^\\>|^\\r\\n ]*?> ”, ^\\s+ ,将 * 更改为 *?,以及其他一些在最佳情况下只能识别窃听案件。

Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.也试过"<{$tag}[^{$tag}]*?>"这给出了空白,我想它会使自己无效。

I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type.我是 regex 的新手,我可以说要解决这个问题,只需要添加不要打开相同类型的新标签。 Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.或者我什至可以为我的用例使用 hack 答案,排除内部文本是否有换行符。

Can anyone get the right syntax for this?任何人都可以获得正确的语法吗?


You can check an extract of the text here: http://pastebin.com/f2naN2S3您可以在此处查看文本摘录: http : //pastebin.com/f2naN2S3


After the proposed change: $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>";建议修改后: $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>"; $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>"; it does work for the the example case, but not for this one:它确实适用于示例案例,但不适用于此案例:

<namespace key="0" />
      <namespace key="1">Talk</namespace>

As it results in:因为它导致:

<namespace key="1">Talk"

It's because numbers and " and letters are considered inside word boundary. How could I address that?这是因为数字和“以及字母被认为是在单词边界内。我该如何解决?

This is probably not the idea answer, but I was messing with a regex generator:这可能不是想法的答案,但我弄乱了正则表达式生成器:

<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11

$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';

$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))';   # Word 1

if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
    $word1=$matches[1][0];
    print "($word1) \n";
}

#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>

The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.主要的问题是,你没有使用一个字边界开放标记之后,因此, namespace的模式也可以匹配namespaces标签,等等。

The subsequent issue is that the <${tag}\\b[^>]*>(.*?)<\\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag.随后的问题是<${tag}\\b[^>]*>(.*?)<\\/${tag}>模式如果有一个自关闭的namespace标签后跟“正常”成对的打开/关闭namespace标签。 So, you need to either use a negative lookbehind (?<!\\/) before the > (see demo ), or use a (?![^>]*\\/>) negative lookahead after \\b (see demo ).因此,您需要在>之前使用负向后视(?<!\\/) (请参阅demo ),或者在\\b之后使用(?![^>]*\\/>)负向后视(请参阅demo )。

So, you can use所以,你可以使用

$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";

This line is what I needed这条线是我需要的

   $tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";

Thank you very much you @Alison and @Wictor for your help and directions非常感谢@Alison 和@Wictor 的帮助和指导

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM