简体   繁体   English

使用PhantomJS获取原始页面内容

[英]Get the raw page content with PhantomJS

Is it possible to get the raw html from a webpage using PhantomJS, before any javascript is executed. 在执行任何javascript之前,是否可以使用PhantomJS从网页获取原始html。

The following script is returning the html after all scripts are loaded and executed. 以下脚本在加载并执行所有脚本后返回html。

var webPage = require('webpage');
var page = webPage.create();

page.open('http://stackoverflow.com', function (status) {
    var content = page.content;
    console.log('Content: ' + content);
    phantom.exit();
});

Is there a way to access also the initial source of the page? 有没有办法访问页面的初始源?

DOMContentLoaded is the earliest event that is triggered when the page is loading, but it seems it is already too late in your case, because JavaScript can be executed before DOMContentLoaded is triggered (think <script>doSomething();</script> ). DOMContentLoaded是在加载页面时触发的最早事件,但在您的情况下似乎已经太晚了,因为JavaScript可以在触发DOMContentLoaded之前执行(想想<script>doSomething();</script> )。

The next idea would be to run setInterval(check, 5); 下一个想法是运行setInterval(check, 5); where check tries to determine whether the initial HTML is fully loaded, but this doesn't guarantee that no other JavaScript already ran and it is impossible to detect whether the page is loaded, because page.content always includes </body></html> . check尝试确定初始HTML是否已完全加载,但这并不能保证没有其他JavaScript已经运行,也无法检测页面是否已加载,因为page.content始终包含</body></html>

The obvious solution would be to disable JavaScript entirely with page.settings.javascriptEnabled = false; 显而易见的解决方案是使用page.settings.javascriptEnabled = false;完全禁用JavaScript page.settings.javascriptEnabled = false; , but if you do that, you won't be able to access the DOM anymore. ,但如果你这样做,你将无法再访问DOM。 The only way do access it, would be through page.content or similar properties. 访问它的唯一方法是通过page.content或类似的属性。

If you need only the page source, don't use PhantomJS for that. 如果您只需要页面源,请不要使用PhantomJS。 The are many solutions for this such as cURL. 有许多解决方案,例如cURL。

It could be done via page.plainText : 它可以通过page.plainText完成:

var page=require('webpage').create();
page.onLoadFinished=function(status) {
    if(status=='success') {
        console.log(page.plainText);
    }
}
page.load('http://stackoverflow.com');

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM