Scrape page after onload JS DOM injection

Question

I'm building a scraper that gets main images (based on Content-Length right now) from a page. It goes through all <img> elements and makes a HEAD request. But certain pages, esp. mobile, have images inserted after page load. Any ideas on how to tackle this?

I'm using node.js .

Answer 1

I can't be sure that it solves your problem, but you could look into using jsdom , as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. Something like:

var request = require('request'),
    jsdom = require('jsdom').jsdom;

request(url, function(err, response, body) {
  if(err) return console.error(err);

  var doc = jsdom(body, null, {
    FetchExternalResources: ['script', 'img']
  });
  var window = doc.createWindow();

  var images = doc.getElementsByTagName('img');
});

Answer 2

Use PhantomJS . It "is a headless WebKit with JavaScript API". Think of it like a whole browser you can control via a JavaScript API. As it is a browser it will fully execute the pages and then you can scrape them.

It is somewhat similar to Node.js, but is really a full browser where your scripts have full access to the DOM of the page you have it pull down. So it is much easier to to 'scrape' a page intelligently by accessing the DOM using something like jQuery, instead of just accessing raw HTML.

Here is an example on DOM manipulation

Scrape page after onload JS DOM injection

Question

2 answers

solution1
0 ACCPTED 2012-03-14 17:28:24

solution2
0 2012-03-14 21:20:54

Scrape page after onload JS DOM injection

Question

2 answers

solution1 0 ACCPTED 2012-03-14 17:28:24

solution2 0 2012-03-14 21:20:54

solution1
0 ACCPTED 2012-03-14 17:28:24

solution2
0 2012-03-14 21:20:54