简体   繁体   English

使用PHP Simple HTML DOM Parser从HTML提取dom元素

[英]Extracting dom elements from html using PHP Simple HTML DOM Parser

I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER . 我正在尝试使用PHP Simple HTML DOM PARSER该站点提取包括文本的文章的链接。

在此处输入图片说明

I want to extract all h2 tags for articles in the main page and I'm trying to do it this way: 我想提取主页上文章的所有h2标签,并且我尝试通过这种方式进行操作:

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;
    }

    print_r($a);

according to the manual it should first get all the content inside article tags then for each article extract the h2 and save in array. 根据手册,应该首先获取article标签中的所有内容,然后为每篇文章提取h2并保存在数组中。 but instead it gives me : 但是相反,它给了我:

在此处输入图片说明

EDIT 编辑 在此处输入图片说明

There are several problems: 有几个问题:

  • getElementsByTagName apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. getElementsByTagName显然返回单个节点,而不是数组,因此,如果页面上有多个article标签,则它将不起作用。 Instead use find which does return an array; 而是使用find来返回数组;
  • But once you make that switch, you cannot use find on a result of find , so you should do that on each individual matched article tag, or better use a combined selector as argument to find ; 但是,一旦你作出这样的开关,你不能用find的结果find ,所以你应该做的是对每一个人相匹配的商品标签,或更好的使用组合选择作为参数来find ;
  • Main issue: You must retrieve the text content of the node explicitly with ->plaintext , otherwise you get the object representation of the node, with all its attributes and internals; 主要问题:必须使用->plaintext显式检索节点的文本内容,否则将获得节点的对象表示以及其所有属性和内部信息;
  • Some of the text contains HTML entities like ’ 某些文本包含HTML实体,例如’ . These can be decoded with html_entity_decode . 这些可以使用html_entity_decode解码。

So this code should work: 所以这段代码应该可以工作:

$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
    $a[] = html_entity_decode($h2->plaintext);
}

Using array_map , you could also do it like this: 使用array_map ,您也可以这样:

$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
               $html->find('article h2'));

If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows: 如果还需要检索文章中的其他标签,并将其文本存储在不同的数组中,则可以执行以下操作:

$a = array();
$b = array();
foreach ($html->find('article') as $article) {
    foreach ($article->find('h2') as $h2) {
        $a[] = html_entity_decode($h2->plaintext);
    }
    foreach ($article->find('h3') as $h3) {
        $b[] = html_entity_decode($h3->plaintext);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM