PHP和Xpath：获取所有一级HTML标签（所有兄弟姐妹）

Question

My functions needs to get all the first level HTML tags from a portion of HTML code so I can then work with each. 我的函数需要从HTML代码的一部分获取所有第一级HTML标记，以便我可以使用它们。

This is my HTML document summarized here: 这是我在这里总结的HTML文档：

<p>The breed was first...</p>
<p>Semencic credits his...</p>

<h1>Appearance</h1>
<p>The breed's distinctive...</p>
<p>It should be symmetrical...</p>

<figure id="attachment_6" style="width: 840px" class="wp-caption alignnone">
    <img class="size-large wp-image-6" src="...jpg" alt="boerboel appearance" width="840" height="746">
    <figcaption class="wp-caption-text">The dog appearance.</figcaption>
</figure>

<h1>Requirements</h1>
<p>Prospective owners....</p>
<p>These dogs....</p>

<h2>A Little Warning!</h2>
<p>If you are considering...</p>
<blockquote>
    <p>According to...</p>
    <p>Source: http://...</p>
</blockquote>
<p>Although more suitable...</p>

Now, I want my output to be: 现在，我希望我的输出为：

p
p
h1
p
p
figure
h1
p
p
h2
p
blockquote
p

But right now, it is: 但是现在，它是：

h1
p
h1
p
h2
p
blockquote
p

There are several things wrong: - the 'figure' isn't showing - the paragraph tags are singled out even when there are several siblings - the first p's aren't found 有几件事是错的： - '数字'没有显示 - 即使有几个兄弟姐妹也会挑出段落标签 - 找不到第一个p'

$doc = new DOMDocument();
$doc->loadHTML( $this->post_content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );

$xpath = new DOMXpath( $doc );
$nodes = $xpath->query( "/*/*" );

foreach ( $nodes as $node ) {

    echo $node->nodeName;
    echo '<br>';

    $this->add_part(
        md5( $node->textContent ),
        $node->nodeName
    );
}

Answer 1

For the records: with your precise HTML sample, I obtain this result: 对于记录：使用精确的HTML示例，我获得了以下结果：

p / h1 / p / p / figure / h1 / p / p / h2 / p / blockquote / p

instead of this (as per your question): 而不是这个（根据你的问题）：

    h1 / p /              h1 / p /     h2 / p / blockquote / p

3v4l.org demo 3v4l.org演示

So, I don't know if this answer will resolve your issue in real code. 所以，我不知道这个答案是否会在实际代码中解决您的问题。

HTML has some rules. HTML有一些规则。 You try to process a code without root element. 您尝试处理没有根元素的代码。 Wrap your code by something like <body> : 用<body>类的东西包装你的代码：

$doc->loadHTML( "<body>$txt</body>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );

By this way, I obtain your desired result: 通过这种方式，我获得了你想要的结果：

p
p
h1
p
p
figure
h1
p
p
h2
p
blockquote
p

3v4l.org demo 3v4l.org演示

Answer 2

DOM (libxml) will reformat the input so that it has a single document element. DOM（libxml）将重新格式化输入，使其具有单个文档元素。 If you remove the parser options ( LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ) it will repair the html and add html and body elements. 如果删除解析器选项（ LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ），它将修复html并添加html和body元素。 So if you want the element nodes inside body you can use the expression //body/* 所以如果你想要body的元素节点你可以使用表达式//body/*

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//body/*') as $node) {
  var_dump($node->nodeName);
}

Output: 输出：

string(1) "p"
string(1) "p"
string(2) "h1"
string(1) "p"
string(1) "p"
string(6) "figure"
string(2) "h1"
string(1) "p"
string(1) "p"
string(2) "h2"
string(1) "p"
string(10) "blockquote"
string(1) "p"

PHP和Xpath：获取所有一级HTML标签（所有兄弟姐妹）

问题描述

2 个解决方案

解决方案1
0 2016-04-28 23:11:28

解决方案2
0 已采纳 2016-04-29 09:23:31

PHP和Xpath：获取所有一级HTML标签（所有兄弟姐妹）

问题描述

2 个解决方案

解决方案1 0 2016-04-28 23:11:28

解决方案2 0 已采纳 2016-04-29 09:23:31

解决方案1
0 2016-04-28 23:11:28

解决方案2
0 已采纳 2016-04-29 09:23:31