简体   繁体   中英

PHP DOMXPath problem

I'm trying to parse blocks of text with html tags, but I have some problems.

    $html = '
        Message <b>bold</b>, <s>strike</s>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;

    function getMessages($element, $xpath)
        $messages = array();

        $children = $element->childNodes;        

        foreach ($children as $child) 

            if(strtolower($child->nodeName) == 'div')
                // my functions
            if ($child->nodeType == XML_TEXT_NODE)
                $text = trim(DOMinnerHTML($element));
                    $messages[] = array('type' => 'text', 'text' => $text);

        return $messages;

    function DOMinnerHTML($element) 
        $innerHTML = null; 
        $children = $element->childNodes;

        foreach ($children as $child) 
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
            $innerHTML .= trim($tmp_dom->saveHTML()); 
        return $innerHTML; 

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages = array();
    $i = 0;
    foreach($messagesXpath as $message)
        $messages[] = getMessages($message, $xpath);
        if ($i == 2)


This code returns the following array:

array(2) {
  array(3) {
    array(2) {
      string(4) "text"
      string(32) "Message<b>bold</b>,<s>strike</s>"
    array(2) {
      string(4) "text"
      string(32) "Message<b>bold</b>,<s>strike</s>"
    array(2) {
      string(4) "text"
      string(32) "Message<b>bold</b>,<s>strike</s>"
  array(2) {
    array(2) {
      string(4) "text"
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

    array(2) {
      string(4) "text"
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

I want to have the $messages['text'] with html tags (it's OK) were, but the array for some reason, repeated!!!!

I think that's problem in this block

if ($child->nodeType == XML_TEXT_NODE)
    $text = trim(DOMinnerHTML($element));
          $messages[] = array('type' => 'text', 'text' => $text);

I think that you are misunderstanding which elements are beings iterated, as you are selecting all the <div> s and then passing each one to getMessages . However, inside getMessages you then iterating over the XML_TEXT_NODE childNodes of each <div> , which is where the double duplication is coming from.

Let's take the HTML:

    Message <b>bold</b>, <s>strike</s>

DOM elements and text nodes are logically different and have different types - XML_ELEMENT_NODE and XML_TEXT_NODE (see here for full list), therefore the <div> actually contains 5 children (TEXT, ELEMENT, TEXT, ELEMENT, TEXT). You were correct to identify the problematic if condition, however simply changing the type to *XML_ELEMENT_NODE* does not completely fix the problem. There are still multiple childNodes where the type is XML_ELEMENT_NODE for each <div> .

To fully fix the problem, I changed the element being passed to the getMessages function so that function can iterate at the correct level and eliminating the duplication. I also removed some complexity improved readability by renaming some variables.

Here is my complete solution:

    $html = <<<HTML
        Message <b>bold</b>, <s>strike</s>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;

    function getMessages($allDivs) {
        $messages = array();

        foreach ($allDivs as $div)  {
            if ($div->nodeType == XML_ELEMENT_NODE) {
                $messages[] = trim(DOMinnerHTML($div));

        return $messages;

    function DOMinnerHTML($element) {
        $innerHTML = null;
        $children = $element->childNodes;

        foreach ($children as $child) {
            $tmp_dom = new DOMDocument();
            $tmp_dom->appendChild($tmp_dom->importNode($child, true));
            $innerHTML .= trim($tmp_dom->saveHTML());
        return $innerHTML;

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages[] = getMessages($messagesXpath);


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM