[英]Regular Expression To Match Header Tags Not In Specific Div
So I have PHP code that puts out HTML that looks like this:所以我有 PHP 代码可以输出如下所示的 HTML:
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
What I'm trying to do is preg_match_all of the header tags.我想要做的是 preg_match_all 标题标签。 My regular expression
(<h([1-6]{1})[^>]*)>.*<\\/h\\2>
returns all of them appropriately, but I don't want to grab the headers that are in the div with the class "ignore".我的正则表达式
(<h([1-6]{1})[^>]*)>.*<\\/h\\2>
适当地返回所有这些,但我不想获取标题在类“忽略”的 div 中。 I was reading about negative lookaheads, but it gets tricky.我正在阅读有关负面预测的文章,但它变得棘手。 Anyone with help will be appreciated.
任何有帮助的人将不胜感激。
Desired output:期望的输出:
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
Note I'm one in here too is omitted because it's wrapped in div with class "ignore".注意 I'm one in here 也被省略了,因为它用类“ignore”包裹在 div 中。
Don't mess around with regular expressions here - unleash the power of DOMDocument
in combination with xpath
queries:不要在这里乱用正则表达式 - 结合
xpath
查询释放DOMDocument
的力量:
<?php
$html = <<<EOT
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
EOT;
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXpath($doc);
$headers = $xpath->query("
//div[not(contains(@class, 'ignore'))]
/*[self::h2 or self::h4 or self::h5]");
foreach ($headers as $header) {
echo $header->nodeValue . "\n";
}
?>
This will yield这将产生
This is a header
This is one too
Here's one
With DOMDocument
and DOMXPath
:使用
DOMDocument
和DOMXPath
:
$html = <<<'HTML'
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('
//*
[contains(";h1;h2;h3;h4;h5;h6;", concat(";", local-name(), ";"))]
[not(ancestor::div[
contains(concat(" ", normalize-space(@class), " "), " ignore ")
])
]');
foreach ($nodeList as $node) {
echo 'tag name: ', $node->nodeName, PHP_EOL,
'html content: ', $dom->saveHTML($node), PHP_EOL,
'text content: ', $node->textContent, PHP_EOL,
PHP_EOL;
}
If you aren't comfortable with XPath take a look at the zvon tutorial .如果您对 XPath 不满意,请查看zvon 教程。
Since you specify you want to do it with preg_match(), here is an example of a negative look-behind (ie filters out those occurrences NOT preceded by XYZ) : https://regex101.com/r/FeAsuj/1由于您指定要使用 preg_match() 执行此操作,因此这里是一个负向后视示例(即过滤掉那些不在 XYZ 前面的事件): https : //regex101.com/r/FeAsuj/1
The lookbehind itself is (?<!<div class=\\"ignore\\">)
.回顾本身是
(?<!<div class=\\"ignore\\">)
。
But in the test-snippet, notice how :但是在测试片段中,请注意如何:
If you MUST continue to work with regex's, consider a 2-step approach :如果您必须继续使用正则表达式,请考虑采用 2 步方法:
In general, I would concur with the other posters to avoid regex, and go with a HTML parser.一般来说,我会同意其他海报以避免正则表达式,并使用 HTML 解析器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.