[英]Webpage's HTML using Phantom
I am trying to use PhantomJS to load a page (that uses Javascript to load items on the webpage) and returns all the HTML on the page (at least within the <body />
tags) to the PHP function that executes phantomjs httpget.js
. 我正在尝试使用PhantomJS加载页面(使用Javascript加载网页上的项目)并将页面上的所有HTML(至少在
<body />
标记内)返回给执行phantomjs httpget.js
的PHP函数。 。
Problem: I can get phantomjs to return the document.title
, but asking it to console.log(document.body)
simple gives me a [object Object]
. 问题:我可以让phantomjs返回
document.title
,但是让它返回console.log(document.body)
简单会给我一个[object Object]
。 How can I extract the page's HTML? 如何提取页面的HTML?
It also takes much longer to load the webpage using phantomjs compared to the browser . 与浏览器相比,使用phantomjs加载网页还需要更长的时间 。
httpget.js httpget.js
console.log('hello!');
var page = require('webpage').create();
page.open("http://www.asos.com/Men/T-Shirts-Vests/Cat/pgecategory.aspx?cid=7616#parentID=-1&pge=0&pgeSize=900&sort=1",
function(status){
console.log('Page title is ' + page.evaluate(function () {
return document.body;
}));
phantom.exit();
});
Output (running from shell) 输出 (从外壳运行)
hello!
Page title is [object Object]
document.body.innerHTML
包含正文的HTML。
Not sure what this has to do with Node.js as you appear to be using PhantomJS directly, not node (or phantom via node-phantom)... 由于您似乎直接使用PhantomJS,而不是节点(或通过node-phantom进行幻像),因此不确定与Node.js有什么关系...
But to answer your question, you need to do this: 但是要回答您的问题,您需要这样做:
var html = page.evaluate(function () {
var root = document.getElementsByTagName("html")[0];
var html = root ? root.outerHTML : document.body.innerHTML;
return html
});
This works with pages that don't have an outer <html> tag. 这适用于没有外部<html>标记的页面。
阅读文档page.content
可以获取整个HTML。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.