简体   繁体   English

在Perl中使用JavaScript支持的Web Crawler?

[英]Web Crawler with JavaScript support in Perl?

I want to code a perl application that would crawl some websites and collect images and links from such webpages. 我想编写一个perl应用程序,它会抓取一些网站并从这些网页收集图像链接 Because the most of pages use JavaScript that generate a HTML content, I need to code quasi a client browser with JavaScript support to be able to parse a final HTML code that is generated and/or modified by JavaScript. 因为大多数页面使用生成HTML内容的JavaScript ,所以我需要使用JavaScript支持编码客户端浏览器,以便能够解析由JavaScript生成和/或修改的最终HTML代码。 What are my options? 我有什么选择?

If possible, please publish some implementation code or link to some example(s). 如果可能,请发布一些实现代码或链接到一些示例。

Options that spring to mind: 想到的选择:

  • You could have Perl use Selenium and have a full-blown browser do the work for you. 您可以让Perl使用Selenium并拥有一个成熟的浏览器为您完成工作。

  • You can download and compile V8 or another open source JavaScript engine and have Perl call an external program to evaluate the JavaScript. 您可以下载并编译V8或其他开源JavaScript引擎,并让Perl调用外部程序来评估JavaScript。

  • I don't think Perl's LWP module supports JavaScript, but you might want to check that if you haven't done so already. 我不认为Perl的LWP模块支持JavaScript,但如果你还没有这样做,你可能想检查一下。

WWW :: ScripterWWW :: Scripter :: Plugin :: JavaScriptWWW :: Scripter :: Plugin :: Ajax插件似乎是最接近你没有使用实际浏览器(模块WWW :: SeleniumMozilla) :: MechanizeWin32 :: IE :: Mechanize使用真正的浏览器)。

Check the complete working example featured in the Scraping pages full of JavaScript . 查看完整JavaScriptScraping页面中的完整工作示例。 It uses Web::Scraper for HTML processing and Gtk3::WebKit to process dynamic content. 它使用Web :: Scraper进行HTML处理,使用Gtk3 :: WebKit处理动态内容。 However, the later one is quite a PITA to install. 但是,后一个是相当安装的PITA。 If there are not-that-many pages you need to scrape (< 1000), fetching the post-processed DOM content through PhantomJS is an interesting option. 如果你需要刮掉很多页面(<1000),那么通过PhantomJS获取经过后处理的DOM内容是一个有趣的选择。 I've written the following script for that purpose: 我为此目的编写了以下脚本:

var page = require('webpage').create(),
    system = require('system'),
    fs = require('fs'),
    address, output;

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: phantomjs --load-images=no html.js URL filename');
    phantom.exit(1);
} else {
    address = system.args[1];
    output = system.args[2];
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
        } else {
            fs.write(output, page.content, 'w');
        }
        phantom.exit();
    });
}

There's something like that on the CPAN already, it's a module called Wight , but I haven't tested it yet. 已经在CPAN上有类似的东西,它是一个名为Wight的模块,但我还没有测试过它。

WWW :: Mechanize :: Firefox可以与mozrepl一起使用,具有所有javascript操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM