简体   繁体   中英

Match multiple occurrences in a single regex

I have a string that looks like

$html = <<<EOT
<p><b>There are currently five entries in the London Borough of Barking &amp; Dagenham (LBBD):</b></p>
<p>My string 1<br>
My another string<br>
And this is also my string<br></p>
<p><i>Some text over here</i></p>
EOT;

I am trying to extract "My string 1", "My another string" and also "And this is also my string" using php preg_match What I have so far is

preg_match("/There are currently .+ entries in .+:<\/b><\/p>\n<p>(.+<br>)\n+/", $html, $matches);
print_r($matches);

But it returns me only the original string and the first occurrence. Is there a way to return an array of occurring matches in a string? Thanks

Use preg_match_all() . PHP doesn't incorporate the g modifier for global matches (or replaces) like most languages. Instead you need to use preg_match() vs. preg_match_all() , or specify a $limit when using preg_replace() (to make it not global).


By default, preg_match_all() will sort your array of $matches with the flag PREG_PATTERN ORDER . In other words: $matches[0] will be an array of full matches, $matches[1] will be an array of capture group 1. This means that count($matches) !== $number_of_matches . If you want $matches[0] to be an array of the first match and its capture group, use the flag PREG_SET_ORDER :

preg_match(
    "/There are currently .+ entries in .+:<\/b><\/p>\n<p>(.+<br>)\n+/",
    $html,
    $matches,
    PREG_SET_ORDER
);

"Is there a way to return an array of occurring matches in a string?" Yes, the function is preg_match_all() .

Now, assuming that you really only want the text, and not any of the html elements, you can use this...

preg_match_all("/(<p>)?(.+)<br>/", $html, $matches);

Then, you'll want to look in $matches[2] for your desired array. That's because all of the matches get stored in $matches[0] , the first grouping gets stored in $matches[1] (that's capturing the <p> tag), and then your content is captured in $matches[2] (the second grouping). If there were more groupings, they'd follow the same pattern.

DEMO

That being said, you should probably look into using a DOM parser for something like this, as regex is generally quite bad at parsing HTML.

You need two entry points, the first is the sentence "There are currently..." until the opening <p> tag, and the second starts at the end of the last match after the <br> tag and the \\n newline.

The first result will use the first entry point, the next results will use the second entry point.

\\G is the anchor that matches the position at the end of the precedent match. This feature is interesting since the preg_match_all retries to match the pattern until the end of the string. But since \\G is initialized with the start of the string at the begining, we need to avoid this case adding (?!\\A) (not at the start of the string) .

Instead of using .+ , I use [^<]+ to avoid to get out of the tag.

To be more readable I use the verbose mode (x modifier) that allows to ignore spaces and to put comments in the pattern. When I need to write literal spaces I put them between \\Q and \\E . All characters between \\Q and \\E are seen as literals (except the pattern delimiter) and spaces are preserved.

$pattern = <<<'EOD'
~                    # using this delimiter instead of / avoids to escape all
                     # the slashes

(?:
    # first entry point
    \QThere are currently \E
    [^<]+?
    \Q entries in \E
    [^<]+ </b> </p> \n <p>
  |
    # second entry point
    (?!\A)\G
    <br>\n
)
\K           # removes all that have been matched before from match result
[^<]+        # the string you want
~x
EOD;

if (preg_match_all($pattern, $text, $matches))
    var_dump($matches[0]);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM