简体   繁体   中英

Webpage's HTML using Phantom

I am trying to use PhantomJS to load a page (that uses Javascript to load items on the webpage) and returns all the HTML on the page (at least within the <body /> tags) to the PHP function that executes phantomjs httpget.js .

Problem: I can get phantomjs to return the document.title , but asking it to console.log(document.body) simple gives me a [object Object] . How can I extract the page's HTML?

It also takes much longer to load the webpage using phantomjs compared to the browser .

httpget.js

console.log('hello!');
var page = require('webpage').create();
page.open("http://www.asos.com/Men/T-Shirts-Vests/Cat/pgecategory.aspx?cid=7616#parentID=-1&pge=0&pgeSize=900&sort=1",
    function(status){
        console.log('Page title is ' + page.evaluate(function () {
            return document.body;
        }));
        phantom.exit();
    });

Output (running from shell)

hello!
Page title is [object Object]

document.body.innerHTML包含正文的HTML。

Not sure what this has to do with Node.js as you appear to be using PhantomJS directly, not node (or phantom via node-phantom)...

But to answer your question, you need to do this:

var html = page.evaluate(function () {
    var root = document.getElementsByTagName("html")[0];
    var html = root ? root.outerHTML : document.body.innerHTML;
    return html
});

This works with pages that don't have an outer <html> tag.

阅读文档page.content可以获取整个HTML。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM