正则表达式匹配不在特定 Div 中的标题标签

Question

So I have PHP code that puts out HTML that looks like this:所以我有 PHP 代码可以输出如下所示的 HTML：

<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>

What I'm trying to do is preg_match_all of the header tags.我想要做的是 preg_match_all 标题标签。 My regular expression (<h([1-6]{1})[^>]*)>.*<\\/h\\2> returns all of them appropriately, but I don't want to grab the headers that are in the div with the class "ignore".我的正则表达式(<h([1-6]{1})[^>]*)>.*<\\/h\\2>适当地返回所有这些，但我不想获取标题在类“忽略”的 div 中。 I was reading about negative lookaheads, but it gets tricky.我正在阅读有关负面预测的文章，但它变得棘手。 Anyone with help will be appreciated.任何有帮助的人将不胜感激。

Desired output:期望的输出：

<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>

Note I'm one in here too is omitted because it's wrapped in div with class "ignore".注意 I'm one in here 也被省略了，因为它用类“ignore”包裹在 div 中。

Answer 1

Don't mess around with regular expressions here - unleash the power of DOMDocument in combination with xpath queries:不要在这里乱用正则表达式 - 结合xpath查询释放DOMDocument的力量：

<?php
$html = <<<EOT
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
EOT;

$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXpath($doc);
$headers = $xpath->query("
    //div[not(contains(@class, 'ignore'))]
    /*[self::h2 or self::h4 or self::h5]");

foreach ($headers as $header) {
    echo $header->nodeValue . "\n";
}

?>

This will yield这将产生

This is a header
This is one too
Here's one

Answer 2

With DOMDocument and DOMXPath :使用DOMDocument和DOMXPath ：

$html = <<<'HTML'
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('
//*
[contains(";h1;h2;h3;h4;h5;h6;", concat(";", local-name(), ";"))]
[not(ancestor::div[
    contains(concat(" ", normalize-space(@class), " "), " ignore ")
    ])
]');

foreach ($nodeList as $node) {
    echo 'tag name: ', $node->nodeName, PHP_EOL,
         'html content: ', $dom->saveHTML($node), PHP_EOL,
         'text content: ', $node->textContent, PHP_EOL,
         PHP_EOL;
}

demo演示

If you aren't comfortable with XPath take a look at the zvon tutorial .如果您对 XPath 不满意，请查看zvon 教程。

Answer 3

Since you specify you want to do it with preg_match(), here is an example of a negative look-behind (ie filters out those occurrences NOT preceded by XYZ) : https://regex101.com/r/FeAsuj/1由于您指定要使用 preg_match() 执行此操作，因此这里是一个负向后视示例（即过滤掉那些不在 XYZ 前面的事件）： https : //regex101.com/r/FeAsuj/1

The lookbehind itself is (?<!<div class=\\"ignore\\">) .回顾本身是(?<!<div class=\\"ignore\\">) 。

But in the test-snippet, notice how :但是在测试片段中，请注意如何：

the regex depends on the exact use of whitespace ...正则表达式取决于空格的确切使用......
... so a platform-dependant \\r\\n can break the regex ...所以依赖于平台的 \\r\\n 可以破坏正则表达式
the lookbehind cannot have a variable length, ie \\n?后视不能有可变长度，即\\n？ - see Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*') - 请参阅正则表达式后视不适用于量词（'+' 或 '*'）

If you MUST continue to work with regex's, consider a 2-step approach :如果您必须继续使用正则表达式，请考虑采用 2 步方法：

step 1, you use preg_replace() to eliminate all unwanted sections.第 1 步，您使用 preg_replace() 消除所有不需要的部分。
step 2, use your existing regex.第 2 步，使用您现有的正则表达式。

In general, I would concur with the other posters to avoid regex, and go with a HTML parser.一般来说，我会同意其他海报以避免正则表达式，并使用 HTML 解析器。

正则表达式匹配不在特定 Div 中的标题标签

问题描述

3 个解决方案

解决方案1
2 2020-03-05 20:40:34

解决方案2
2 2020-03-05 20:59:00

解决方案3
-2 已采纳 2020-03-05 21:19:14

正则表达式匹配不在特定 Div 中的标题标签

问题描述

3 个解决方案

解决方案1 2 2020-03-05 20:40:34

解决方案2 2 2020-03-05 20:59:00

解决方案3 -2 已采纳 2020-03-05 21:19:14

解决方案1
2 2020-03-05 20:40:34

解决方案2
2 2020-03-05 20:59:00

解决方案3
-2 已采纳 2020-03-05 21:19:14