I am trying to utilize PhantomJS to get html generated by dynamic page. I supposed that this would be easy, but after few hours of trying, I am still not lucky.
The page itself has this source code and what gets saved in 1.html eventually:
<!doctype html>
<html lang="cs" ng-app="appId">
<head ng-controller="MainCtrl">
(ommited some lines)
<script src="/js/conf/config.js?pars"></script>
<script src="/js/all.js?pars"></script>
</head>
<body>
<!--<![endif]-->
<div site-loader></div>
<div page-layout>
<div ng-view></div>
</div>
</body>
</html>
All content of web gets loaded inside site-loader div, but I have no luck to get it, even though I am using timeout before scraping html by PhantomJS. Here goes code I am using:
var url = 'http:...';
var page = require('webpage').create();
var fs = require('fs');
page.open(url, function (status) {
if (status !== 'success') {
console.log('Fail');
phantom.exit();
} else {
window.setTimeout(function () {
fs.write('1.html', page.content, 'w');
phantom.exit();
}, 2000); // Change timeout as required to allow sufficient time
}
});
Please what am I doing wrong?
EDIT: I have decided to try PJscrapper framework and configured it to scrappe all contents of div block. All I got was lousy:
["","\n\t\tif (window.DOT) {\n\t\t\tDOT.cfg({service: 'sreality', impress: false});\n\t\t}\n\t","","Loader.load()","",""]
Seems that I seriously do not get it and always get code before Loader.load() acts. And obviously, timeout does not solve it.
This will do the trick
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the url!');
phantom.exit();
} else {
window.setTimeout(function () {
var results = page.evaluate(function() {
return document.documentElement.innerHTML;
});
console.log(results)
phantom.exit();
}, 200);
}
});
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.