在聚合网站上爬行并使用HTML

Question

I am working on a crawling script in PHP. 我正在使用PHP进行爬网脚本。 I am using PHP Simple HTML DOM Parser. 我正在使用PHP简单HTML DOM解析器。

After getting the HTML I need to extract only some of the info from each page and aggregate these into my own HTML page on my site. 获取HTML之后，我只需要从每个页面中提取一些信息，并将这些信息汇总到我自己站点上的HTML页面中。

I am unable to understand how to proceed on this. 我无法理解如何进行此操作。

Any help is appreciated. 任何帮助表示赞赏。

Added 添加

I want to extract some posts (if related to a particular geography and topic) 我想提取一些帖子（如果与特定的地理位置和主题相关）

Answer 1

Regular expressions may be the way to get complex info out of the data but for simple tags you can use something like: 正则表达式可能是从数据中获取复杂信息的方法，但是对于简单标签，您可以使用类似以下内容的方法：

// Create DOM from URL or file //从URL或文件创建DOM
$html = file_get_html('http://www.google.com/'); $ html = file_get_html（'http://www.google.com/'）;

// Find all images //查找所有图片
foreach($html->find('img') as $element) foreach（$ html-> find（'img'）作为$ element）
echo $element->src . echo $ element-> src。 '<br>'; '<br>';

// Find all links //查找所有链接
foreach($html->find('a') as $element) foreach（$ html-> find（'a'）作为$ element）
echo $element->href . echo $ element-> href。 '<br>'; '<br>';

Answer 2

You could do something like that: 您可以这样做：

$doc = new DomDocument();
@$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$nodeList = $xpath->query("your-xpath-query");
foreach ($nodeList as $node) {
    // grab the content, attributes or whatever you'r looking for
}

Using Xpath queries you don't have to traverse the DOM tree manually and your script is more robust against structural changes in the sites you crawl. 使用Xpath查询，您不必手动遍历DOM树，并且脚本对捕获的站点中的结构更改更健壮。

I hope that gets you on the right track. 我希望这能使您走上正确的道路。 For a more detailed example you have to provide more information. 有关更详细的示例，您必须提供更多信息。

在聚合网站上爬行并使用HTML

问题描述

2 个解决方案

解决方案1
0 已采纳 2010-12-08 08:40:59

解决方案2
0 2010-12-08 08:41:21

在聚合网站上爬行并使用HTML

问题描述

2 个解决方案

解决方案1 0 已采纳 2010-12-08 08:40:59

解决方案2 0 2010-12-08 08:41:21

解决方案1
0 已采纳 2010-12-08 08:40:59

解决方案2
0 2010-12-08 08:41:21