[英]Extracting dom elements from html using PHP Simple HTML DOM Parser
I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER . 我正在尝试使用PHP Simple HTML DOM PARSER从该站点提取包括文本的文章的链接。
I want to extract all h2
tags for articles in the main page and I'm trying to do it this way: 我想提取主页上文章的所有h2
标签,并且我尝试通过这种方式进行操作:
$html = file_get_html('http://www.winbeta.org');
$articles = $html->getElementsByTagName('article');
$a = null;
foreach ($articles->find('h2') as $header) {
$a[] = $header;
}
print_r($a);
according to the manual it should first get all the content inside article
tags then for each article extract the h2 and save in array. 根据手册,应该首先获取article
标签中的所有内容,然后为每篇文章提取h2并保存在数组中。 but instead it gives me : 但是相反,它给了我:
There are several problems: 有几个问题:
getElementsByTagName
apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. getElementsByTagName
显然返回单个节点,而不是数组,因此,如果页面上有多个article标签,则它将不起作用。 Instead use find
which does return an array; 而是使用find
来返回数组; find
on a result of find
, so you should do that on each individual matched article tag, or better use a combined selector as argument to find
; 但是,一旦你作出这样的开关,你不能用find
的结果find
,所以你应该做的是对每一个人相匹配的商品标签,或更好的使用组合选择作为参数来find
; ->plaintext
, otherwise you get the object representation of the node, with all its attributes and internals; 主要问题:必须使用->plaintext
显式检索节点的文本内容,否则将获得节点的对象表示以及其所有属性和内部信息; ’
某些文本包含HTML实体,例如’
. 。 These can be decoded with html_entity_decode
. 这些可以使用html_entity_decode
解码。 So this code should work: 所以这段代码应该可以工作:
$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
$a[] = html_entity_decode($h2->plaintext);
}
Using array_map
, you could also do it like this: 使用array_map
,您也可以这样:
$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); },
$html->find('article h2'));
If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows: 如果还需要检索文章中的其他标签,并将其文本存储在不同的数组中,则可以执行以下操作:
$a = array();
$b = array();
foreach ($html->find('article') as $article) {
foreach ($article->find('h2') as $h2) {
$a[] = html_entity_decode($h2->plaintext);
}
foreach ($article->find('h3') as $h3) {
$b[] = html_entity_decode($h3->plaintext);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.