简体   繁体   English

正则表达式匹配不在特定 Div 中的标题标签

[英]Regular Expression To Match Header Tags Not In Specific Div

So I have PHP code that puts out HTML that looks like this:所以我有 PHP 代码可以输出如下所示的 HTML:

<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>

What I'm trying to do is preg_match_all of the header tags.我想要做的是 preg_match_all 标题标签。 My regular expression (<h([1-6]{1})[^>]*)>.*<\\/h\\2> returns all of them appropriately, but I don't want to grab the headers that are in the div with the class "ignore".我的正则表达式(<h([1-6]{1})[^>]*)>.*<\\/h\\2>适当地返回所有这些,但我不想获取标题在类“忽略”的 div 中。 I was reading about negative lookaheads, but it gets tricky.我正在阅读有关负面预测的文章,但它变得棘手。 Anyone with help will be appreciated.任何有帮助的人将不胜感激。

Desired output:期望的输出:

<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>

Note I'm one in here too is omitted because it's wrapped in div with class "ignore".注意 I'm one in here 也被省略了,因为它用类“ignore”包裹在 div 中。

Don't mess around with regular expressions here - unleash the power of DOMDocument in combination with xpath queries:不要在这里乱用正则表达式 - 结合xpath查询释放DOMDocument的力量:

<?php
$html = <<<EOT
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
EOT;

$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXpath($doc);
$headers = $xpath->query("
    //div[not(contains(@class, 'ignore'))]
    /*[self::h2 or self::h4 or self::h5]");

foreach ($headers as $header) {
    echo $header->nodeValue . "\n";
}

?>

This will yield这将产生

This is a header
This is one too
Here's one

With DOMDocument and DOMXPath :使用DOMDocumentDOMXPath

$html = <<<'HTML'
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('
//*
[contains(";h1;h2;h3;h4;h5;h6;", concat(";", local-name(), ";"))]
[not(ancestor::div[
    contains(concat(" ", normalize-space(@class), " "), " ignore ")
    ])
]');

foreach ($nodeList as $node) {
    echo 'tag name: ', $node->nodeName, PHP_EOL,
         'html content: ', $dom->saveHTML($node), PHP_EOL,
         'text content: ', $node->textContent, PHP_EOL,
         PHP_EOL;
}

demo演示

If you aren't comfortable with XPath take a look at the zvon tutorial .如果您对 XPath 不满意,请查看zvon 教程

Since you specify you want to do it with preg_match(), here is an example of a negative look-behind (ie filters out those occurrences NOT preceded by XYZ) : https://regex101.com/r/FeAsuj/1由于您指定要使用 preg_match() 执行此操作,因此这里是一个负向后视示例(即过滤掉那些不在 XYZ 前面的事件): https : //regex101.com/r/FeAsuj/1

The lookbehind itself is (?<!<div class=\\"ignore\\">) .回顾本身是(?<!<div class=\\"ignore\\">)

But in the test-snippet, notice how :但是在测试片段中,请注意如何:

If you MUST continue to work with regex's, consider a 2-step approach :如果您必须继续使用正则表达式,请考虑采用 2 步方法:

  • step 1, you use preg_replace() to eliminate all unwanted sections.第 1 步,您使用 preg_replace() 消除所有不需要的部分。
  • step 2, use your existing regex.第 2 步,使用您现有的正则表达式。

In general, I would concur with the other posters to avoid regex, and go with a HTML parser.一般来说,我会同意其他海报以避免正则表达式,并使用 HTML 解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM