[英]Scraping fully rendered webpage with nodejs
I am trying to get amazon pricing information with nodejs. 我正在尝试使用nodejs获得亚马逊的定价信息。
Here's the target url: http://aws.amazon.com/ec2/pricing/ 这是目标网址: http : //aws.amazon.com/ec2/pricing/
But the content of the pricing tables which I am reading in nodejs is not fully rendered and there are only javascripts. 但是我在nodejs中阅读的定价表的内容并未完全呈现,只有javascript。
So far I have used jsdom
, jquerygo
and phantom
but I was not successful. 到目前为止,我已经使用了
jsdom
, jquerygo
和phantom
但是没有成功。 Even setting timeouts does not help. 即使设置超时也无济于事。 Can anyone please provide me with a working solution for this specific case?
有人可以为我提供针对此特定情况的有效解决方案吗?
Thanks and best regards. 谢谢和最好的问候。
There are different ways to scrape a web page using node.js 有多种方法可以使用node.js抓取网页
I was inspired by spookjs 我受到了spookjs的启发
var Spooky = require('spooky');
var spooky = new Spooky({
child: {
transport: 'http'
},
casper: {
logLevel: 'debug',
verbose: true
}
}, function (err) {
if (err) {
e = new Error('Failed to initialize SpookyJS');
e.details = err;
throw e;
}
spooky.start(
'http://en.wikipedia.org/wiki/Spooky_the_Tuff_Little_Ghost');
spooky.then(function () {
this.emit('hello', 'Hello, from ' + this.evaluate(function () {
return document.title;
}));
});
spooky.run();
});
spooky.on('error', function (e, stack) {
console.error(e);
if (stack) {
console.log(stack);
}
});
spooky.on('console', function (line) {
console.log(line);
});
spooky.on('hello', function (greeting) {
console.log(greeting);
});
spooky.on('log', function (log) {
if (log.space === 'remote') {
console.log(log.message.replace(/ \- .*/, ''));
}
});
This solved my issue: 这解决了我的问题:
I noticed that when installing phantom module in node, it was complaining about version of phantomjs (version 2) and was downloading version (1.9.8) in some temporary location. 我注意到在节点中安装phantom模块时,它抱怨phantomjs的版本(版本2),并在某个临时位置下载了版本(1.9.8)。
Thus I installed version 1.9.8 instead and set the PATH variable to that. 因此,我安装了1.9.8版,并将PATH变量设置为该版本。 And it worked!
而且有效! Also must note that inside page.open(...) function you must setTimeout for quite a long time (in my case about 35 seconds) so that the whole page is fully loaded and rendered.
还必须注意,在page.open(...)函数内部,必须长时间设置setTimeout(在我的情况下为35秒),以便完全加载和呈现整个页面。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.