简体   繁体   English

使用PhantomJS和node.js保存并呈现网页

[英]Save and render a webpage with PhantomJS and node.js

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page. 我正在寻找一个请求网页,等待JavaScript呈现(JavaScript修改DOM),然后抓取页面的HTML的示例。

This should be a simple example with an obvious use-case for PhantomJS. 这应该是一个简单的例子,有一个明显的PhantomJS用例。 I can't find a decent example, the documentation seems to be all about command line use. 我找不到一个体面的例子,文档似乎都是关于命令行使用的。

From your comments, I'd guess you have 2 options 根据你的评论,我猜你有两个选择

  1. Try to find a phantomjs node module - https://github.com/amir20/phantomjs-node 尝试找到一个phantomjs节点模块 - https://github.com/amir20/phantomjs-node
  2. Run phantomjs as a child process inside node - http://nodejs.org/api/child_process.html 将phantomjs作为节点内的子进程运行 - http://nodejs.org/api/child_process.html

Edit: 编辑:

It seems the child process is suggested by phantomjs as a way of interacting with node, see faq - http://code.google.com/p/phantomjs/wiki/FAQ 似乎phantomjs建议将子进程作为与节点交互的一种方式,请参阅faq - http://code.google.com/p/phantomjs/wiki/FAQ

Edit: 编辑:

Example Phantomjs script for getting the pages HTML markup: 用于获取页面HTML标记的示例Phantomjs脚本:

var page = require('webpage').create();  
page.open('http://www.google.com', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var p = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML
        });
        console.log(p);
    }
    phantom.exit();
});

With v2 of phantomjs-node it's pretty easy to print the HTML after it has been processed. 使用phantomjs-node v2,在处理HTML之后很容易打印HTML。

var phantom = require('phantom');

phantom.create().then(function(ph) {
  ph.createPage().then(function(page) {
    page.open('https://stackoverflow.com/').then(function(status) {
      console.log(status);
      page.property('content').then(function(content) {
        console.log(content);
        page.close();
        ph.exit();
      });
    });
  });
});

This will show the output as it would have been rendered with the browser. 这将显示使用浏览器呈现的输出。

Edit 2019: 编辑2019:

You can use async/await : 您可以使用async/await

const phantom = require('phantom');

(async function() {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const status = await page.open('https://stackoverflow.com/');
  const content = await page.property('content');
  console.log(content);

  await instance.exit();
})();

Or if you just want to test, you can use npx 或者,如果您只是想测试,可以使用npx

npx phantom@latest https://stackoverflow.com/

I've used two different ways in the past, including the page.evaluate() method that queries the DOM that Declan mentioned. 我过去曾使用过两种不同的方法,包括查询Declan提到的DOM的page.evaluate()方法。 The other way I've passed info from the web page is to spit it out to console.log() from there, and in the phantomjs script use: 我从网页传递信息的另一种方法是从那里吐出到console.log(),并在phantomjs脚本中使用:

page.onConsoleMessage = function (msg, line, source) {
  console.log('console [' +source +':' +line +']> ' +msg);
}

I might also trap the variable msg in the onConsoleMessage and search for some encapsulate data. 我也可能在onConsoleMessage中捕获变量msg并搜索一些封装数据。 Depends on how you want to use the output. 取决于您想如何使用输出。

Then in the Nodejs script, you would have to scan the output of the Phantomjs script: 然后在Nodejs脚本中,您必须扫描Phantomjs脚本的输出:

var yourfunc = function(...params...) {
  var phantom = spawn('phantomjs', [...args]);
  phantom.stdout.setEncoding('utf8');
  phantom.stdout.on('data', function(data) {
    //parse or echo data
    var str_phantom_output = data.toString();
    // The above will get triggered one or more times, so you'll need to
    // add code to parse for whatever info you're expecting from the browser
  });
  phantom.stderr.on('data', function(data) {
    // do something with error data
  });
  phantom.on('exit', function(code) {
    if (code !== 0) {
      // console.log('phantomjs exited with code ' +code);
    } else {
      // clean exit: do something else such as a passed-in callback
    }
  });
}

Hope that helps some. 希望有所帮助。

Why not just use this ? 为什么不用这个呢?

var page = require('webpage').create();
page.open("http://example.com", function (status)
{
    if (status !== 'success') 
    {
        console.log('FAIL to load the address');            
    } 
    else 
    {
        console.log('Success in fetching the page');
        console.log(page.content);
    }
    phantom.exit();
});

Late update in case anyone stumbles on this question: 如果有人在这个问题上遇到困难,可以延迟更新:

A project on GitHub developed by a colleague of mine exactly aims at helping you do that: https://github.com/vmeurisse/phantomCrawl . 我的一位同事开发的GitHub项目正是为了帮助你做到这一点: https//github.com/vmeurisse/phantomCrawl

It still a bit young, it certainly is missing some documentation, but the example provided should help doing basic crawling. 它仍然有点年轻,它肯定缺少一些文档,但提供的示例应该有助于进行基本爬行。

Here's an old version that I use running node, express and phantomjs which saves out the page as a .png. 这是一个旧版本,我使用运行node,express和phantomjs将页面保存为.png。 You could tweak it fairly quickly to get the html. 您可以相当快地调整它来获取HTML。

https://github.com/wehrhaus/sitescrape.git https://github.com/wehrhaus/sitescrape.git

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM