简体   繁体   中英

Scrape page after onload JS DOM injection

I'm building a scraper that gets main images (based on Content-Length right now) from a page. It goes through all <img> elements and makes a HEAD request. But certain pages, esp. mobile, have images inserted after page load. Any ideas on how to tackle this?

I'm using node.js .

I can't be sure that it solves your problem, but you could look into using jsdom , as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. Something like:

var request = require('request'),
    jsdom = require('jsdom').jsdom;

request(url, function(err, response, body) {
  if(err) return console.error(err);

  var doc = jsdom(body, null, {
    FetchExternalResources: ['script', 'img']
  });
  var window = doc.createWindow();

  var images = doc.getElementsByTagName('img');
});

Use PhantomJS . It "is a headless WebKit with JavaScript API". Think of it like a whole browser you can control via a JavaScript API. As it is a browser it will fully execute the pages and then you can scrape them.

It is somewhat similar to Node.js, but is really a full browser where your scripts have full access to the DOM of the page you have it pull down. So it is much easier to to 'scrape' a page intelligently by accessing the DOM using something like jQuery, instead of just accessing raw HTML.

Here is an example on DOM manipulation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM