[英]Web Crawler with JavaScript support in Perl?
I want to code a perl application that would crawl some websites and collect images and links from such webpages. 我想编写一个perl应用程序,它会抓取一些网站并从这些网页收集图像和链接 。 Because the most of pages use JavaScript that generate a HTML content, I need to code quasi a client browser with JavaScript support to be able to parse a final HTML code that is generated and/or modified by JavaScript.
因为大多数页面使用生成HTML内容的JavaScript ,所以我需要使用JavaScript支持编码客户端浏览器,以便能够解析由JavaScript生成和/或修改的最终HTML代码。 What are my options?
我有什么选择?
If possible, please publish some implementation code or link to some example(s). 如果可能,请发布一些实现代码或链接到一些示例。
There are several options. 有几种选择。
Options that spring to mind: 想到的选择:
You could have Perl use Selenium and have a full-blown browser do the work for you. 您可以让Perl使用Selenium并拥有一个成熟的浏览器为您完成工作。
You can download and compile V8 or another open source JavaScript engine and have Perl call an external program to evaluate the JavaScript. 您可以下载并编译V8或其他开源JavaScript引擎,并让Perl调用外部程序来评估JavaScript。
I don't think Perl's LWP module supports JavaScript, but you might want to check that if you haven't done so already. 我不认为Perl的LWP模块支持JavaScript,但如果你还没有这样做,你可能想检查一下。
WWW :: Scripter与WWW :: Scripter :: Plugin :: JavaScript和WWW :: Scripter :: Plugin :: Ajax插件似乎是最接近你没有使用实际浏览器(模块WWW :: Selenium , Mozilla) :: Mechanize或Win32 :: IE :: Mechanize使用真正的浏览器)。
Check the complete working example featured in the Scraping pages full of JavaScript . 查看完整JavaScript的Scraping页面中的完整工作示例。 It uses Web::Scraper for HTML processing and Gtk3::WebKit to process dynamic content.
它使用Web :: Scraper进行HTML处理,使用Gtk3 :: WebKit处理动态内容。 However, the later one is quite a PITA to install.
但是,后一个是相当安装的PITA。 If there are not-that-many pages you need to scrape (< 1000), fetching the post-processed DOM content through PhantomJS is an interesting option.
如果你需要刮掉很多页面(<1000),那么通过PhantomJS获取经过后处理的DOM内容是一个有趣的选择。 I've written the following script for that purpose:
我为此目的编写了以下脚本:
var page = require('webpage').create(),
system = require('system'),
fs = require('fs'),
address, output;
if (system.args.length < 3 || system.args.length > 5) {
console.log('Usage: phantomjs --load-images=no html.js URL filename');
phantom.exit(1);
} else {
address = system.args[1];
output = system.args[2];
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
} else {
fs.write(output, page.content, 'w');
}
phantom.exit();
});
}
There's something like that on the CPAN already, it's a module called Wight , but I haven't tested it yet. 已经在CPAN上有类似的东西,它是一个名为Wight的模块,但我还没有测试过它。
WWW :: Mechanize :: Firefox可以与mozrepl一起使用,具有所有javascript操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.