PHP：如何获取HTML元素的正确结束标记

Question

假设我有一个HTML页面，如下所示：

<!-- This is the opening tag -->
<div class="content_text">
  <div>Title</div>
  <div>Author Name</div>
  <div>Some complicated HTML elements correctly validated</div>
  <b>Some more text</b>
  <img ... />
  <div> more and more text </div>
</div><!-- This is the correct closing tag -->

如何在带有class="content_text"的div的开头与其正确的结束标记之间获取内容？

我尝试使用正则表达式，但是找不到任何简单甚至困难的方法。

我尝试了XPath ，但是仍然无法获取内容。 相反，我将文本放在外部div中。

Answer 1

您可以使用PHP Simple HTML DOM解析器来解析HTML，就像DOMDocument用于XML一样。

注意： PHP也直接支持DOMDocument 。

Answer 2

    $scrape_address = "http://www.al-madina.com/node/444862";
    $ch = curl_init($scrape_address);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1'); 
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_ENCODING, "");
    $data = curl_exec($ch);
    // I couldn't get an element by Attribute so I just replaced class to id
    $data = str_replace('class="content_text"','id="my_unique_id"',$data);

    $domd = new DOMDocument();
    libxml_use_internal_errors(true);
    $domd->loadHTML($data);
    libxml_use_internal_errors(false);
    $div = $domd->getElementById("my_unique_id");

    if ($div) {
      $dom2 = new DOMDocument();
      $dom2->appendChild($dom2->importNode($div, true));
      echo $dom2->saveHTML();
    } else {
      echo "Nothing found";
    }

Answer 3

我建议使用PHP的DOMDocument-除非内容的结构总是完全相同，否则正则表达式将无法使用，即使那样也不会很漂亮。

另外，这是一个有关通过使用SimpleXML解决的类似情况的问题，也许可以提供帮助。

Answer 4

您似乎已经能够成功运行XPath查询，因此我省略了PHP代码，直接进入XPath部分。

不知道您所说的“内容”是什么意思，所以我提供了一些替代方案：

您需要<div/>内的所有文本节点：

//div[@class="content_text"]//text()

您希望所有XML包括以下元素：

//div[@class="content_text"]

两者都将返回一组结果，请确保对其进行循环。

PHP：如何获取HTML元素的正确结束标记

问题描述

4 个解决方案

解决方案1
5 2013-04-09 22:22:06

解决方案2
4 已采纳 2013-04-09 22:22:33

解决方案3
2 2013-04-09 22:22:53

解决方案4
0 2013-04-09 22:52:12

PHP：如何获取HTML元素的正确结束标记

问题描述

4 个解决方案

解决方案1 5 2013-04-09 22:22:06

解决方案2 4 已采纳 2013-04-09 22:22:33

解决方案3 2 2013-04-09 22:22:53

解决方案4 0 2013-04-09 22:52:12

解决方案1
5 2013-04-09 22:22:06

解决方案2
4 已采纳 2013-04-09 22:22:33

解决方案3
2 2013-04-09 22:22:53

解决方案4
0 2013-04-09 22:52:12