简体   繁体   English

在聚合网站上爬行并使用HTML

[英]Crawling and working on HTML for aggregation site

I am working on a crawling script in PHP. 我正在使用PHP进行爬网脚本。 I am using PHP Simple HTML DOM Parser. 我正在使用PHP简单HTML DOM解析器。

After getting the HTML I need to extract only some of the info from each page and aggregate these into my own HTML page on my site. 获取HTML之后,我只需要从每个页面中提取一些信息,并将这些信息汇总到我自己站点上的HTML页面中。

I am unable to understand how to proceed on this. 我无法理解如何进行此操作。

Any help is appreciated. 任何帮助表示赞赏。

Added 添加

I want to extract some posts (if related to a particular geography and topic) 我想提取一些帖子(如果与特定的地理位置和主题相关)

Regular expressions may be the way to get complex info out of the data but for simple tags you can use something like: 正则表达式可能是从数据中获取复杂信息的方法,但是对于简单标签,您可以使用类似以下内容的方法:


// Create DOM from URL or file //从URL或文件创建DOM
$html = file_get_html('http://www.google.com/'); $ html = file_get_html('http://www.google.com/');

// Find all images //查找所有图片
foreach($html->find('img') as $element) foreach($ html-> find('img')作为$ element)
echo $element->src . echo $ element-> src。 '<br>'; '<br>';

// Find all links //查找所有链接
foreach($html->find('a') as $element) foreach($ html-> find('a')作为$ element)
echo $element->href . echo $ element-> href。 '<br>'; '<br>';

You could do something like that: 您可以这样做:

$doc = new DomDocument();
@$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$nodeList = $xpath->query("your-xpath-query");
foreach ($nodeList as $node) {
    // grab the content, attributes or whatever you'r looking for
}

Using Xpath queries you don't have to traverse the DOM tree manually and your script is more robust against structural changes in the sites you crawl. 使用Xpath查询,您不必手动遍历DOM树,并且脚本对捕获的站点中的结构更改更健壮。

I hope that gets you on the right track. 我希望这能使您走上正确的道路。 For a more detailed example you have to provide more information. 有关更详细的示例,您必须提供更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM