简体   繁体   中英

In php, how can I use a regular expression to capture everything between two patterns (and the shortest instance of each pattern)?

I must be overcomplicating this, but I can't figure it out for the life of me.

I have a standard html document stored as a string, and I need to get the contents of the paragraph. I'll make an example case.

$stringHTML=
"<html>

<head>
<title>Title</title>
</head>

<body>

<p>This is the first paragraph</p>
<p>This is the second</p>
<p>This is the third</p>
<p>And fourth</p>

</body>
</html>";

If I use

$regex='~(<p>)(.*)(</p>)~i';
preg_match_all($regex, $stringHTML, $newVariable); 

I won't get 4 results. Rather, I'll get 10. I get 10 because the regex matches the first <p> and first </p> as well as the first <p> and fourth </p>

How can I search between two words, and return only the results of whats between each paragraph?

Use HTML parser like DOM or XPATH to parse HTML. Dont use Regex to parse HTML . Here is how it can be easily parsed by DOMDocument.

$doc = new \DOMDocument;
$doc->loadHTML($stringHTML);
$ps = $doc->getElementsByTagName("p");
for($i=0;$i<$ps->length; $i++){
    echo $ps->item($i)->textContent. "\n";
}

Code in action


Using this RegEx (as you said its a regex practice ) you'll get 4 results.

preg_match_all("#<p>(.*)</p>#", $stringHTML, $matches);
print_r($matches[1]);

Here look around syntaxes are used. See the code in action .

Use .*? to get the shortest match instead of the longest match.

Your regex should be /<p>(.*?)<\\/p>/i . It will only matches the strings between <p></p> and put it in an array.

you shouldn't do a group : (<p>)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM