简体   繁体   English

使用Node.js抓取完全渲染的网页

[英]Scraping fully rendered webpage with nodejs

I am trying to get amazon pricing information with nodejs. 我正在尝试使用nodejs获得亚马逊的定价信息。

Here's the target url: http://aws.amazon.com/ec2/pricing/ 这是目标网址: http : //aws.amazon.com/ec2/pricing/

But the content of the pricing tables which I am reading in nodejs is not fully rendered and there are only javascripts. 但是我在nodejs中阅读的定价表的内容并未完全呈现,只有javascript。

So far I have used jsdom , jquerygo and phantom but I was not successful. 到目前为止,我已经使用了jsdomjquerygophantom但是没有成功。 Even setting timeouts does not help. 即使设置超时也无济于事。 Can anyone please provide me with a working solution for this specific case? 有人可以为我提供针对此特定情况的有效解决方案吗?

Thanks and best regards. 谢谢和最好的问候。

There are different ways to scrape a web page using node.js 有多种方法可以使用node.js抓取网页

I was inspired by spookjs 我受到了spookjs的启发

 var Spooky = require('spooky');

 var spooky = new Spooky({
    child: {
        transport: 'http'
    },
    casper: {
        logLevel: 'debug',
        verbose: true
    }
  }, function (err) {
    if (err) {
        e = new Error('Failed to initialize SpookyJS');
        e.details = err;
        throw e;
    }

    spooky.start(
        'http://en.wikipedia.org/wiki/Spooky_the_Tuff_Little_Ghost');
    spooky.then(function () {
        this.emit('hello', 'Hello, from ' + this.evaluate(function ()     {
            return document.title;
        }));
    });
    spooky.run();
});

spooky.on('error', function (e, stack) {
console.error(e);

if (stack) {
    console.log(stack);
}
});


spooky.on('console', function (line) {
   console.log(line);
});

spooky.on('hello', function (greeting) {
   console.log(greeting);
});

spooky.on('log', function (log) {
   if (log.space === 'remote') {
     console.log(log.message.replace(/ \- .*/, ''));
   }
});

Note: Gives flexibility to run casperjs and phantom js using node.js 注意:提供了使用node.js来运行casperjs和phantom js的灵活性

This solved my issue: 这解决了我的问题:

I noticed that when installing phantom module in node, it was complaining about version of phantomjs (version 2) and was downloading version (1.9.8) in some temporary location. 我注意到在节点中安装phantom模块时,它抱怨phantomjs的版本(版本2),并在某个临时位置下载了版本(1.9.8)。

Thus I installed version 1.9.8 instead and set the PATH variable to that. 因此,我安装了1.9.8版,并将PATH变量设置为该版本。 And it worked! 而且有效! Also must note that inside page.open(...) function you must setTimeout for quite a long time (in my case about 35 seconds) so that the whole page is fully loaded and rendered. 还必须注意,在page.open(...)函数内部,必须长时间设置setTimeout(在我的情况下为35秒),以便完全加载和呈现整个页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM