简体   繁体   English

哪些PHP Web爬网程序库可用?

[英]What PHP web crawler libraries are available?

I'm looking for some robust, well documented PHP web crawler scripts. 我正在寻找一些健壮的,记录良好的PHP Web爬虫脚本。 Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial 也许是Java项目的PHP端口 - http://wiki.apache.org/nutch/NutchTutorial

I'm looking for both free and non free versions. 我正在寻找免费和非免费版本。

https://github.com/fabpot/Goutte也是一个兼容psr-0标准的好库。

Just give Snoopy a try. 试试Snoopy吧。

Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example." 摘录:“Snoopy是一个模拟Web浏览器的PHP类。它可以自动执行检索网页内容和发布表单的任务。”

There is a greate tutorial here which combines guzzlehttp and symfony/dom-crawler 这里有一个greate教程它结合了guzzlehttpsymfony / dom-crawler

In case the link is lost here is the code you can make use. 如果链接丢失,这里是您可以使用的代码。

use Guzzle\Http\Client;
use Symfony\Component\DomCrawler\Crawler;
use RuntimeException;

// create http client instance
$client = new GuzzleHttp\ClientClient('http://download.cloud.com/releases');

// create a request
$response = $client->request('/3.0.6/api_3.0.6/TOC_Domain_Admin.html');

// get status code
$status = $response->getStatusCode();

// this is the response body from the requested page (usually html)
//$result = $response->getBody();

// crate crawler instance from body HTML code
$crawler = new Crawler($response->getBody(true));

// apply css selector filter
$filter = $crawler->filter('div.apismallbullet_box');
$result = array();

if (iterator_count($filter) > 1) {

    // iterate over filter results
    foreach ($filter as $i => $content) {

        // create crawler instance for result
        $cralwer = new Crawler($content);
        // extract the values needed
        $result[$i] = array(
            'topic' => $crawler->filter('h5')->text();
            'className' => trim(str_replace(' ', '', $result[$i]['topic'])) . 'Client'
        );
    }
} else {
    throw new RuntimeException('Got empty result processing the dataset!');
}

You can use PHP Simple HTML DOM Parser . 您可以使用PHP Simple HTML DOM Parser It's really simple and useful. 它非常简单实用。

I've been using Simple HTML DOM for about 3 years before I discovered phpQuery . 在我发现phpQuery之前,我已经使用Simple HTML DOM大约3年了。 It's a lot faster, not working recursively (you can actually dump it) and has a full support for jQuery selectors and methods. 它的速度要快得多,不能递归地工作(你可以实际转储它)并且完全支持jQuery选择器和方法。

if you are thinking about a strong base component than give a try to http://symfony.com/doc/2.0/components/dom_crawler.html 如果你正在考虑一个强大的基础组件而不是尝试http://symfony.com/doc/2.0/components/dom_crawler.html

it is amazing, having a features like css selector. 它很棒,有像css选择器这样的功能。

I know it is a bit old question. 我知道这是一个有点老问题。 A lot of useful libraries came out since then. 从那时起,出现了许多有用的库。

Give it a shot to Crawlzone . Crawlzone一个机会 It is fast, well documented, asynchronous internet crawling framework with a lot of great features: 它是一个快速,文档齐全的异步Internet爬网框架,具有许多强大的功能:

  • Asynchronous crawling with customizable concurrency. 使用可自定义的并发进行异步爬网。
  • Automatically throttling crawling speed based on the load of the website you are crawling. 根据您正在抓取的网站的负载自动限制抓取速度。
  • If configured, automatically filters out requests forbidden by the robots.txt exclusion standard. 如果已配置,则会自动过滤掉robots.txt排除标准禁止的请求。
  • Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response. 简单明了的中间件系统允许您追加标头,提取数据,过滤或插入任何自定义功能来处理请求和响应。
  • Rich filtering capabilities. 丰富的过滤功能。
  • Ability to set crawling depth 能够设置爬行深度
  • Easy to extend the core by hooking into the crawling process using events. 通过使用事件挂钩到爬行过程,可以轻松扩展核心。
  • Shut down crawler any time and start over without losing the progress. 随时关闭爬虫并重新开始而不会丢失进度。

Also check out the article I wrote about it: 另请查看我写的关于它的文章:

https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

Nobody mentioned wget as a good starting point?. 没人提到wget是一个很好的起点?

wget -r --level=10 -nd http://www.mydomain.com/

More @ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/ 更多@ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM