简体   繁体   中英

PHP DOMDocument / XPath: Get HTML-text and surrounded tags

I am looking for this functionality:

Given is this html-Page:

<body>
 <h1>Hello,
  <b>world!</b>
 </h1>
</body>

I want to get an array that only contains the DISTINCT text elements (no duplicates) and an array of the tags that surround the text elements:

The result to the above "html" would be an array that looks like this:

array => 
 "Hello," surrounded by => "h1" and "body"
 "world!" surrounded by => "b", "h1" and "body"

I alreday do this:

$res=$xpath->query("//body//*/text()");

which gives me the distinct text-contents but that omits the html-tags.

When I just do this:

$res=$xpath->query("//body//*");

I get duplicate texts, one for each tag-constellation: eg: "world!" would show up 3 times, one time for "body", one time for "h1" and one time for "b" but I don't seem to be able to get the information which texts are acutally duplicates. Just checking for duplicate text is not sufficient, as duplicate texts are sometimes just substrings of former texts or a website could contain real duplicate text which would then be discarded which is wrong.

How could I solve this issue?

Thank you very much!!

Thomas

You can iterate over the parentNodes of the DOMText nodes:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = array();
foreach($xpath->query('/html/body//text()') as $i => $textNode) {
    $textNodes[$i] = array(
        'text' => $textNode->nodeValue,
        'parents' => array()
    );
    for (
        $currentNode = $textNode->parentNode;
        $currentNode->parentNode;
        $currentNode = $currentNode->parentNode
    ) {
        $textNodes[$i]['parents'][] = $currentNode->nodeName;
    }
}
print_r($textNodes);

demo

Note that loadHTML will add implied elements, eg it will add html and head elements which you will have to take into account when using XPath. Also note that any whitespace used for formatting is considered a DOMText so you will likely get more elements than you expect. If you only want to query for non-empty DOMText nodes use

/html/body//text()[normalize-space(.) != ""]

demo

In your sample code, $res=$xpath->query("//body//*/text()") is a DOMNodeList of DOMText nodes. For each DOMText , you can access the containing element via the parentNode property.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM