简体   繁体   English

如何使用PhantomJS和node.js进行抓取?

[英]How to use PhantomJS along with node.js for scraping?

I have installed node-phantom by npm install node-phantom but when I am running this code, it is giving Cannot find module 'webpage' this error 我已经通过npm install node-phantom但是当我运行此代码时,它给出了Cannot find module 'webpage'错误

var webpage = require('webpage').create(),
    url = "https://www.example.com/cba/abc",
    hrefs = new Array();
webpage.open(url,function(status){
    if(status=="success"){
        var results = page.evaluate(function(){
            $("#endpoints").each(function() {
                  hrefs.push($(this).attr("href"));
            });
            return hrefs;
        });
        console.log(JSON.stringify(results));
        phantom.exit();
    }
});

You don't require the webpage module in node-phantom. 您不需要在node-phantom中使用网页模块。 You would use its API to get a representation of the webpage module. 您将使用其API来获取网页模块的表示形式。 It has to be done this way, because PhantomJS has a different execution runtime from node.js. 必须这样做,因为PhantomJS与node.js具有不同的执行运行时。 They generally can't use the same modules. 他们通常不能使用相同的模块。 That is why there are bridges between those two execution environments like node-phantom and phantom . 这就是为什么在这两个执行环境(例如node-phantomphantom)之间建立桥梁的原因。 They essentially replicate the API of PhantomJS to be used in node.js. 他们本质上复制了在node.js中使用的PhantomJS API。

As per documentation, you don't require the webpage, you get a page instead: 根据文档,您不需要该网页,而是获得一个页面:

var phantom = require('node-phantom');
phantom.create(function(err,ph) {
  return ph.createPage(function(err,page) {
    // do something with page: basically your script
  });
});

You won't be able to just copy and paste existing PhantomJS code. 您将无法仅复制和粘贴现有的PhantomJS代码。 There are differences, so you will have to study the API (basically the README on github). 两者之间存在差异,因此您必须学习API(基本上是github上的README)。

Complete translation of your code: 完整的代码翻译:

var phantom = require('node-phantom');
phantom.create(function(err,ph) {
  return ph.createPage(function(err,page) {
    page.open(url,function(status){
      if(status=="success"){
        page.evaluate(function(){
          hrefs = [];
          $("#endpoints").each(function() {
            hrefs.push($(this).attr("href"));
          });
          return hrefs;
        }, function(err, results){
          console.log(JSON.stringify(results));
          ph.exit();
        });
      }
    });
  });
});

page.evaluate is still sandboxed, so you can't use variables from the outside like hrefs . page.evaluate仍然处于沙盒状态,因此您不能像hrefs这样使用外部变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM