简体   繁体   English

PHP DOMXPath问题

[英]PHP DOMXPath problem

I'm trying to parse blocks of text with html tags, but I have some problems.我正在尝试使用 html 标签解析文本块,但我遇到了一些问题。

<?php
    libxml_use_internal_errors(true);
    $html = '
<html>
<body>
    <div>
        Message <b>bold</b>, <s>strike</s>
    </div>
    <div>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>
    </div>
</body>
</html>
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;
    $dom->loadHTML($html);        

    function getMessages($element, $xpath)
    {
        $messages = array();

        $children = $element->childNodes;        

        foreach ($children as $child) 
        { 

            if(strtolower($child->nodeName) == 'div')
            {
                // my functions
            }
            else
            if ($child->nodeType == XML_TEXT_NODE)
            {
                $text = trim(DOMinnerHTML($element));
                if($text)
                {
                    $messages[] = array('type' => 'text', 'text' => $text);
                }
            }
        }

        return $messages;
    }

    function DOMinnerHTML($element) 
    {
        $innerHTML = null; 
        $children = $element->childNodes;

        foreach ($children as $child) 
        {
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
            $innerHTML .= trim($tmp_dom->saveHTML()); 
        } 
        return $innerHTML; 
    } 

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages = array();
    $i = 0;
    foreach($messagesXpath as $message)
    {
        $messages[] = getMessages($message, $xpath);
        $i++;
        if ($i == 2)
        break;
    }

    var_dump($messages);  

This code returns the following array:此代码返回以下数组:

array(2) {
  [0]=>
  array(3) {
    [0]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
    [1]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
    [2]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
  }
  [1]=>
  array(2) {
    [0]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

        </span>"
    }
    [1]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>"
    }
  }
}

I want to have the $messages['text'] with html tags (it's OK) were, but the array for some reason, repeated!!!!我想让 $messages['text'] 带有 html 标签(没关系),但由于某种原因,数组重复了!!!!

I think that's problem in this block我认为这是这个街区的问题

if ($child->nodeType == XML_TEXT_NODE)
{
    $text = trim(DOMinnerHTML($element));
    if($text)
    {
          $messages[] = array('type' => 'text', 'text' => $text);
    }
}

I think that you are misunderstanding which elements are beings iterated, as you are selecting all the <div> s and then passing each one to getMessages .我认为您误解了正在迭代哪些元素,因为您选择了所有<div> ,然后将每个元素传递给getMessages However, inside getMessages you then iterating over the XML_TEXT_NODE childNodes of each <div> , which is where the double duplication is coming from.但是,在getMessages中,您然后迭代每个<div>XML_TEXT_NODE节点,这是双重复制的来源。

Let's take the HTML:让我们以 HTML 为例:

<div>
    Message <b>bold</b>, <s>strike</s>
</div>

DOM elements and text nodes are logically different and have different types - XML_ELEMENT_NODE and XML_TEXT_NODE (see here for full list), therefore the <div> actually contains 5 children (TEXT, ELEMENT, TEXT, ELEMENT, TEXT). DOM 元素和文本节点在逻辑上是不同的并且具有不同的类型 - XML_ELEMENT_NODE 和 XML_TEXT_NODE(完整列表请参见此处),因此<div>实际上包含 5 个子节点(TEXT、ELEMENT、TEXT、ELEMENT、TEXT)。 You were correct to identify the problematic if condition, however simply changing the type to *XML_ELEMENT_NODE* does not completely fix the problem.您正确地确定了有问题的if条件,但是仅将类型更改为 *XML_ELEMENT_NODE* 并不能完全解决问题。 There are still multiple childNodes where the type is XML_ELEMENT_NODE for each <div> .仍然有多个子节点,其中每个<div>的类型为 XML_ELEMENT_NODE。

To fully fix the problem, I changed the element being passed to the getMessages function so that function can iterate at the correct level and eliminating the duplication.为了完全解决这个问题,我更改了传递给getMessages function 的元素,以便 function 可以在正确的级别进行迭代并消除重复。 I also removed some complexity improved readability by renaming some variables.我还通过重命名一些变量来消除一些复杂性以提高可读性。

Here is my complete solution:这是我的完整解决方案:

<?php
    libxml_use_internal_errors(true);
    $html = <<<HTML
<html>
<body>
    <div>
        Message <b>bold</b>, <s>strike</s>
    </div>
    <div>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>
    </div>
</body>
</html>
HTML;

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;
    $dom->loadHTML($html);

    function getMessages($allDivs) {
        $messages = array();

        foreach ($allDivs as $div)  {
            if ($div->nodeType == XML_ELEMENT_NODE) {
                $messages[] = trim(DOMinnerHTML($div));
            }
        }

        return $messages;
    }

    function DOMinnerHTML($element) {
        $innerHTML = null;
        $children = $element->childNodes;

        foreach ($children as $child) {
            $tmp_dom = new DOMDocument();
            $tmp_dom->appendChild($tmp_dom->importNode($child, true));
            $innerHTML .= trim($tmp_dom->saveHTML());
        }
        return $innerHTML;
    }

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages[] = getMessages($messagesXpath);

    print_r($messages);
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM