简体   繁体   English

PHP的DOMXPath正在剥离匹配文本中的标签

[英]PHP's DOMXPath is stripping out my tags inside the matched text

I asked this question yesterday, and at the time it was just what I needed, but while working with some live data I discovered that is wasn't quite doing what I expected. 我昨天问了这个问题,当时它正是我所需要的,但在处理一些实时数据时,我发现这并不是我所期望的那样。 Parse HTML with PHP's HTML DOMDocument 使用PHP的HTML DOMDocument解析HTML

It gets the data from the HTML page, but then it also strips out all the HTML tags inside the captured block of text, which isn't what I want. 它从HTML页面获取数据,但随后它也会删除捕获的文本块中的所有HTML标记,这不是我想要的。 (I might wan't to take some of the tags out, but not all, and this can be done later) (我可能不想拿出一些标签,但不是全部,这可以在以后完成)

That's a common problem with DOM : you have to do a bit more work if you want to get the content of a tag, and the content of all its children. 这是DOM的常见问题:如果您想获取标签的内容及其所有子项的内容,则必须做更多的工作。

Basically, you have to loop over the child nodes of the one you've matched with your XPath query, to get their contents. 基本上,您必须遍历与XPath查询匹配的子节点,以获取其内容。

There is a solution proposed in one one the user notes on the manual page of the DOMElement class -- see this note . 用户在DOMElement的手册页上注明了一个解决方案 - 请参阅本说明


Integrating this solution into the code you already have should give you something that looks like this for the declaration of the HTML string, with sub-tags : 将此解决方案集成到您已有的代码中应该为HTML字符串的声明提供类似于此的内容,并使用子标记:

$html = <<<HTML
<div class="main">
    <div class="text">
        <p>
            Capture this <strong>text</strong> <em>1</em>
        </p>
        <p>
            And some other <strong>text</strong>
        </p>
    </div>
</div>
HTML;


And, to extract the data from that HTML string, you can use something like that : 并且,要从该HTML字符串中提取数据,您可以使用以下内容:

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    // see http://fr.php.net/manual/en/class.domelement.php#86803
    $children = $tag->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        $innerHTML .= $tmp_doc->saveHTML();
    }

    var_dump(trim($innerHTML));
}

The only thing that has changed is the content of the foreach loop : instead of just using $tag->nodeValue , you have to iterate over the child elements. 唯一改变的是foreach循环的内容:您不必仅使用$tag->nodeValue ,而是必须迭代子元素。


Which gives me the following output : 这给了我以下输出:

string '<p>
            Capture this <strong>text</strong> <em>1</em>
        </p>


<p>
            And some other <strong>text</strong>
        </p>' (length=150)

Which is the full content of the <div> tag that was matched, and all its children -- including the tags. 这是匹配的<div>标记的全部内容及其所有子标记 - 包括标记。


Note : there are often interesting ideas and solution in the users notes of the manual ;-) 注意:手册的用户注释中经常有有趣的想法和解决方案;-)

Pascal MARTIN's answer is great, but I found it can be simplified Pascal MARTIN的答案很棒,但我发现它可以简化

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    $children = $tag->childNodes;
    foreach ($children as $child) {     
        $innerHTML .= $dom->saveHTML($child);
    }

    var_dump(trim($innerHTML));
}

This way appears to produce the same result, but doesn't require new DomDocument objects being created inside the foreach loop. 这种方式似乎产生相同的结果,但不需要在foreach循环内创建新的DomDocument对象。

EDIT: 编辑:

So, after further experimentation, you can actually reduce the above to this: 因此,经过进一步的实验,您实际上可以将上述内容减少到:

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($dom->saveHTML($tag)));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM