哪些PHP Web爬网程序库可用？

Question

I'm looking for some robust, well documented PHP web crawler scripts. 我正在寻找一些健壮的，记录良好的PHP Web爬虫脚本。 Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial 也许是Java项目的PHP端口 - http://wiki.apache.org/nutch/NutchTutorial

I'm looking for both free and non free versions. 我正在寻找免费和非免费版本。

Answer 1

https://github.com/fabpot/Goutte也是一个兼容psr-0标准的好库。

Answer 2

Just give Snoopy a try. 试试Snoopy吧。

Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example." 摘录：“Snoopy是一个模拟Web浏览器的PHP类。它可以自动执行检索网页内容和发布表单的任务。”

Answer 3

There is a greate tutorial here which combines guzzlehttp and symfony/dom-crawler 这里有一个greate教程，它结合了guzzlehttp和symfony / dom-crawler

In case the link is lost here is the code you can make use. 如果链接丢失，这里是您可以使用的代码。

use Guzzle\Http\Client;
use Symfony\Component\DomCrawler\Crawler;
use RuntimeException;

// create http client instance
$client = new GuzzleHttp\ClientClient('http://download.cloud.com/releases');

// create a request
$response = $client->request('/3.0.6/api_3.0.6/TOC_Domain_Admin.html');

// get status code
$status = $response->getStatusCode();

// this is the response body from the requested page (usually html)
//$result = $response->getBody();

// crate crawler instance from body HTML code
$crawler = new Crawler($response->getBody(true));

// apply css selector filter
$filter = $crawler->filter('div.apismallbullet_box');
$result = array();

if (iterator_count($filter) > 1) {

    // iterate over filter results
    foreach ($filter as $i => $content) {

        // create crawler instance for result
        $cralwer = new Crawler($content);
        // extract the values needed
        $result[$i] = array(
            'topic' => $crawler->filter('h5')->text();
            'className' => trim(str_replace(' ', '', $result[$i]['topic'])) . 'Client'
        );
    }
} else {
    throw new RuntimeException('Got empty result processing the dataset!');
}

Answer 4

You can use PHP Simple HTML DOM Parser . 您可以使用PHP Simple HTML DOM Parser 。 It's really simple and useful. 它非常简单实用。

Answer 5

I've been using Simple HTML DOM for about 3 years before I discovered phpQuery . 在我发现phpQuery之前，我已经使用Simple HTML DOM大约3年了。 It's a lot faster, not working recursively (you can actually dump it) and has a full support for jQuery selectors and methods. 它的速度要快得多，不能递归地工作（你可以实际转储它）并且完全支持jQuery选择器和方法。

Answer 6

if you are thinking about a strong base component than give a try to http://symfony.com/doc/2.0/components/dom_crawler.html 如果你正在考虑一个强大的基础组件而不是尝试http://symfony.com/doc/2.0/components/dom_crawler.html

it is amazing, having a features like css selector. 它很棒，有像css选择器这样的功能。

Answer 7

I know it is a bit old question. 我知道这是一个有点老问题。 A lot of useful libraries came out since then. 从那时起，出现了许多有用的库。

Give it a shot to Crawlzone . 给Crawlzone一个机会。 It is fast, well documented, asynchronous internet crawling framework with a lot of great features: 它是一个快速，文档齐全的异步Internet爬网框架，具有许多强大的功能：

Asynchronous crawling with customizable concurrency. 使用可自定义的并发进行异步爬网。
Automatically throttling crawling speed based on the load of the website you are crawling. 根据您正在抓取的网站的负载自动限制抓取速度。
If configured, automatically filters out requests forbidden by the robots.txt exclusion standard. 如果已配置，则会自动过滤掉robots.txt排除标准禁止的请求。
Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response. 简单明了的中间件系统允许您追加标头，提取数据，过滤或插入任何自定义功能来处理请求和响应。
Rich filtering capabilities. 丰富的过滤功能。
Ability to set crawling depth 能够设置爬行深度
Easy to extend the core by hooking into the crawling process using events. 通过使用事件挂钩到爬行过程，可以轻松扩展核心。
Shut down crawler any time and start over without losing the progress. 随时关闭爬虫并重新开始而不会丢失进度。

Also check out the article I wrote about it: 另请查看我写的关于它的文章：

https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

Answer 8

Nobody mentioned wget as a good starting point?. 没人提到wget是一个很好的起点？

wget -r --level=10 -nd http://www.mydomain.com/

More @ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/ 更多@ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/

哪些PHP Web爬网程序库可用？

问题描述

8 个解决方案

解决方案1
4 2013-04-15 09:42:42

解决方案2
4 已采纳 2011-01-30 12:06:05

解决方案3
2 2017-01-03 05:30:00

解决方案4
2 2011-01-30 10:48:08

解决方案5
2 2011-01-30 10:52:17

解决方案6
1 2013-03-29 20:01:00

解决方案7
1 2018-06-07 15:03:24

解决方案8
-2 2013-02-11 23:41:53

哪些PHP Web爬网程序库可用？

问题描述

8 个解决方案

解决方案1 4 2013-04-15 09:42:42

解决方案2 4 已采纳 2011-01-30 12:06:05

解决方案3 2 2017-01-03 05:30:00

解决方案4 2 2011-01-30 10:48:08

解决方案5 2 2011-01-30 10:52:17

解决方案6 1 2013-03-29 20:01:00

解决方案7 1 2018-06-07 15:03:24

解决方案8 -2 2013-02-11 23:41:53

解决方案1
4 2013-04-15 09:42:42

解决方案2
4 已采纳 2011-01-30 12:06:05

解决方案3
2 2017-01-03 05:30:00

解决方案4
2 2011-01-30 10:48:08

解决方案5
2 2011-01-30 10:52:17

解决方案6
1 2013-03-29 20:01:00

解决方案7
1 2018-06-07 15:03:24

解决方案8
-2 2013-02-11 23:41:53