简体   繁体   English

在单个正则表达式中匹配多个匹配项

[英]Match multiple occurrences in a single regex

I have a string that looks like 我有一个看起来像的字符串

$html = <<<EOT
<p><b>There are currently five entries in the London Borough of Barking &amp; Dagenham (LBBD):</b></p>
<p>My string 1<br>
My another string<br>
And this is also my string<br></p>
<p><i>Some text over here</i></p>
EOT;

I am trying to extract "My string 1", "My another string" and also "And this is also my string" using php preg_match What I have so far is 我正在尝试使用php preg_match提取“我的字符串1”,“我的另一个字符串”以及“这也是我的字符串”,到目前为止,我的工作是

preg_match("/There are currently .+ entries in .+:<\/b><\/p>\n<p>(.+<br>)\n+/", $html, $matches);
print_r($matches);

But it returns me only the original string and the first occurrence. 但是它只返回原始字符串和第一个匹配项。 Is there a way to return an array of occurring matches in a string? 有没有办法返回字符串中发生的匹配数组? Thanks 谢谢

Use preg_match_all() . 使用preg_match_all() PHP doesn't incorporate the g modifier for global matches (or replaces) like most languages. PHP没有像大多数语言一样为全局匹配(或替换)引入g修饰符。 Instead you need to use preg_match() vs. preg_match_all() , or specify a $limit when using preg_replace() (to make it not global). 相反,您需要使用preg_match()preg_match_all() ,或者在使用preg_replace()时指定$limit (以使其不全局)。


By default, preg_match_all() will sort your array of $matches with the flag PREG_PATTERN ORDER . 默认情况下, preg_match_all()将使用标志PREG_PATTERN ORDER$matches数组进行排序。 In other words: $matches[0] will be an array of full matches, $matches[1] will be an array of capture group 1. This means that count($matches) !== $number_of_matches . 换句话说: $matches[0]将是完整匹配项的数组, $matches[1]将是捕获组1的数组。这意味着count($matches) !== $number_of_matches If you want $matches[0] to be an array of the first match and its capture group, use the flag PREG_SET_ORDER : 如果希望$matches[0]是第一个匹配项及其捕获组的数组,请使用标志PREG_SET_ORDER

preg_match(
    "/There are currently .+ entries in .+:<\/b><\/p>\n<p>(.+<br>)\n+/",
    $html,
    $matches,
    PREG_SET_ORDER
);

"Is there a way to return an array of occurring matches in a string?" “有没有办法返回字符串中发生的匹配数组?” Yes, the function is preg_match_all() . 是的,函数是preg_match_all()

Now, assuming that you really only want the text, and not any of the html elements, you can use this... 现在,假设您真的只想要文本,而不想要任何html元素,则可以使用此...

preg_match_all("/(<p>)?(.+)<br>/", $html, $matches);

Then, you'll want to look in $matches[2] for your desired array. 然后,您需要在$matches[2]查找所需的数组。 That's because all of the matches get stored in $matches[0] , the first grouping gets stored in $matches[1] (that's capturing the <p> tag), and then your content is captured in $matches[2] (the second grouping). 这是因为所有匹配项都存储在$matches[0] ,第一个分组存储在$matches[1] (捕获<p>标签),然后您的内容捕获在$matches[2] (第二分组)。 If there were more groupings, they'd follow the same pattern. 如果有更多的分组,它们将遵循相同的模式。

DEMO 演示

That being said, you should probably look into using a DOM parser for something like this, as regex is generally quite bad at parsing HTML. 话虽这么说,您可能应该考虑将DOM解析器用于类似这样的事情,因为regex通常在解析HTML方面非常糟糕。

You need two entry points, the first is the sentence "There are currently..." until the opening <p> tag, and the second starts at the end of the last match after the <br> tag and the \\n newline. 您需要两个入口点,第一个是句子"There are currently..."直到开始<p>标记为止,第二个开始是在<br>标记和\\n换行符之后的最后一个匹配项的末尾。

The first result will use the first entry point, the next results will use the second entry point. 第一个结果将使用第一个入口点,下一个结果将使用第二个入口点。

\\G is the anchor that matches the position at the end of the precedent match. \\G是匹配先前匹配末尾位置的锚。 This feature is interesting since the preg_match_all retries to match the pattern until the end of the string. 此功能很有趣,因为preg_match_all重试以匹配模式,直到字符串结尾。 But since \\G is initialized with the start of the string at the begining, we need to avoid this case adding (?!\\A) (not at the start of the string) . 但是由于\\G是在开头的字符串开头进行初始化的,因此我们需要避免这种情况添加(?!\\A) (而不是在字符串的开头)

Instead of using .+ , I use [^<]+ to avoid to get out of the tag. 我没有使用.+ ,而是使用[^<]+来避免脱离标签。

To be more readable I use the verbose mode (x modifier) that allows to ignore spaces and to put comments in the pattern. 为了提高可读性,我使用详细模式(x修饰符) ,该模式允许忽略空格并将注释放入模式中。 When I need to write literal spaces I put them between \\Q and \\E . 当我需要写文字空间时,可以将它们放在\\Q\\E All characters between \\Q and \\E are seen as literals (except the pattern delimiter) and spaces are preserved. \\Q\\E之间的所有字符均视为文字(模式定界符除外) ,并保留空格。

$pattern = <<<'EOD'
~                    # using this delimiter instead of / avoids to escape all
                     # the slashes

(?:
    # first entry point
    \QThere are currently \E
    [^<]+?
    \Q entries in \E
    [^<]+ </b> </p> \n <p>
  |
    # second entry point
    (?!\A)\G
    <br>\n
)
\K           # removes all that have been matched before from match result
[^<]+        # the string you want
~x
EOD;

if (preg_match_all($pattern, $text, $matches))
    var_dump($matches[0]);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM