简体   繁体   中英

preg match text between tags excluding same tag in between

Well I know there several questions similar but could not find any with this specific case.

I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.

Code:

$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match); 

  static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case      string(56) "<namespaces>
//      <namespace key="-2">Media</namespace>"
      $tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
      $tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';

      preg_match_all($tag_regex,
      $xml,
      $matches,
      PREG_OFFSET_CAPTURE);
      return $matches;
  }

As you can see, there is a bug if the tag is nested:

<namespaces> <namespace key="-2">Media</namespace>

When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.

I tried to add " <{$tag}[^\\>|^\\r\\n ]*?> ", ^\\s+ , changing the * to *?, and other few things that in best case turned to recognize only the bugged case.

Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.

I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type. Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.

Can anyone get the right syntax for this?


You can check an extract of the text here: http://pastebin.com/f2naN2S3


After the proposed change: $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>"; $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>"; it does work for the the example case, but not for this one:

<namespace key="0" />
      <namespace key="1">Talk</namespace>

As it results in:

<namespace key="1">Talk"

It's because numbers and " and letters are considered inside word boundary. How could I address that?

This is probably not the idea answer, but I was messing with a regex generator:

<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11

$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';

$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))';   # Word 1

if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
    $word1=$matches[1][0];
    print "($word1) \n";
}

#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>

The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.

The subsequent issue is that the <${tag}\\b[^>]*>(.*?)<\\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\\/) before the > (see demo ), or use a (?![^>]*\\/>) negative lookahead after \\b (see demo ).

So, you can use

$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";

This line is what I needed

   $tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";

Thank you very much you @Alison and @Wictor for your help and directions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM