Well I know there several questions similar but could not find any with this specific case.
I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.
Code:
$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match);
static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case string(56) "<namespaces>
// <namespace key="-2">Media</namespace>"
$tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
$tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';
preg_match_all($tag_regex,
$xml,
$matches,
PREG_OFFSET_CAPTURE);
return $matches;
}
As you can see, there is a bug if the tag is nested:
<namespaces> <namespace key="-2">Media</namespace>
When it should return 'Media', or even the outer '<namespaces>'
and then the inside ones.
I tried to add " <{$tag}[^\\>|^\\r\\n ]*?>
", ^\\s+
, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.
Also tried "<{$tag}[^{$tag}]*?>"
which gives blank, I suppose it nullifies itself.
I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type. Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.
Can anyone get the right syntax for this?
You can check an extract of the text here: http://pastebin.com/f2naN2S3
After the proposed change: $tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>";
$tag_ini = "<{$tag}\\\\b[^>]*>"; $tag_end = "<\\\\/{$tag}>";
it does work for the the example case, but not for this one:
<namespace key="0" />
<namespace key="1">Talk</namespace>
As it results in:
<namespace key="1">Talk"
It's because numbers and " and letters are considered inside word boundary. How could I address that?
This is probably not the idea answer, but I was messing with a regex generator:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>
The main problem is that you did not use a word boundary after the opening tag and thus, namespace
in the pattern could also match namespaces
tag, and many others.
The subsequent issue is that the <${tag}\\b[^>]*>(.*?)<\\/${tag}>
pattern would overfire if there is a self-closing namespace
tag followed with a "normal" paired open/close namespace
tag. So, you need to either use a negative lookbehind (?<!\\/)
before the >
(see demo ), or use a (?![^>]*\\/>)
negative lookahead after \\b
(see demo ).
So, you can use
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";
This line is what I needed
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you @Alison and @Wictor for your help and directions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.