使用PHP Simple HTML DOM Parser从HTML提取dom元素

Question

我正在尝试使用PHP Simple HTML DOM PARSER从该站点提取包括文本的文章的链接。

我想提取主页上文章的所有h2标签，并且我尝试通过这种方式进行操作：

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;
    }

    print_r($a);

根据手册，应该首先获取article标签中的所有内容，然后为每篇文章提取h2并保存在数组中。 但是相反，它给了我：

编辑

Answer 1

有几个问题：

getElementsByTagName显然返回单个节点，而不是数组，因此，如果页面上有多个article标签，则它将不起作用。 而是使用find来返回数组；
但是，一旦你作出这样的开关，你不能用find的结果find ，所以你应该做的是对每一个人相匹配的商品标签，或更好的使用组合选择作为参数来find ;
主要问题：必须使用->plaintext显式检索节点的文本内容，否则将获得节点的对象表示以及其所有属性和内部信息；
某些文本包含HTML实体，例如’ 。 这些可以使用html_entity_decode解码。

所以这段代码应该可以工作：

$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
    $a[] = html_entity_decode($h2->plaintext);
}

使用array_map ，您也可以这样：

$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
               $html->find('article h2'));

如果还需要检索文章中的其他标签，并将其文本存储在不同的数组中，则可以执行以下操作：

$a = array();
$b = array();
foreach ($html->find('article') as $article) {
    foreach ($article->find('h2') as $h2) {
        $a[] = html_entity_decode($h2->plaintext);
    }
    foreach ($article->find('h3') as $h3) {
        $b[] = html_entity_decode($h3->plaintext);
    }
}

使用PHP Simple HTML DOM Parser从HTML提取dom元素

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-01-05 20:32:07

使用PHP Simple HTML DOM Parser从HTML提取dom元素

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-01-05 20:32:07

解决方案1
4 已采纳 2016-01-05 20:32:07