简体   繁体   中英

Web scraping using PhantomJS

Is there a way to execute all the JavaScripts in a webpage exactly like the browser without specifying which function to execute? In most of the examples that I saw these seem to specify which portion of JavaScript you want to execute from the scraped webpage. I need to scrape all of the contents and execute all of the JavaScripts just like a browser and get me the final executed code which we can see using google inspect?

I am sure there must be some way, but the example code from PhantomJS did not seem to have any example addressing this.

You don't specify what gets executed from the page with PhantomJS. You open the page with PhantomJS and all JavaScript that is executed in Chrome or Firefox is also executed in PhantomJS. It is a full browser without a "head".

There are some differences though. Clicking a download link will not trigger a download. The rendering engine which PhantomJS 1.x is based upon is nearly 4 years old, so some pages are simply rendered differently, because PhantomJS 1.x might not support that feature. (PhantomJS 2 is on the way and now in unofficial "alpha" status)

So you need to script every interaction that a user is doing on the page with JavaScript or CoffeeScript. You don't call page functions. You manipulate DOM elements to simulate a user interacting with the page in the browser. This needs to be done in such a crude way, because the PhantomJS API doesn't provide high-level user-like functions. If you want those, you have to look at CasperJS which is built on top of PhantomJS/SlimerJS.

There you actually have functions like click , wait , fetchText , etc.

This will work, put this in a file named "scrape.js" and execute it with phantomjs. Pass your url as the first arg

 // Usage: phantomjs scrape.js http://your.url.to.scrape.com "use strict"; var sys = require("system"), page = require("webpage").create(), logResources = false, url = sys.args[1] //console.log('fetch from', url); function printArgs() { var i, ilen; for (i = 0, ilen = arguments.length; i < ilen; ++i) { console.log(" arguments[" + i + "] = " + JSON.stringify(arguments[i])); } console.log(""); } //////////////////////////////////////////////////////////////////////////////// page.onLoadFinished = function() { page.evaluate(function() { console.log(document.body.innerHTML); }); }; // window.console.log(msg); page.onConsoleMessage = function() { printArgs.apply(this, arguments); phantom.exit(0); }; //////////////////////////////////////////////////////////////////////////////// page.open(url); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM