在onload JS DOM注入后抓取页面

Question

I'm building a scraper that gets main images (based on Content-Length right now) from a page. 我正在构建一个从页面获取主图像（当前基于Content-Length ）的刮板。 It goes through all <img> elements and makes a HEAD request. 它遍历所有<img>元素并发出HEAD请求。 But certain pages, esp. 但是某些页面，尤其是 mobile, have images inserted after page load. 移动设备，在页面加载后插入图片。 Any ideas on how to tackle this? 关于如何解决这个问题的任何想法？

I'm using node.js . 我正在使用node.js

Answer 1

I can't be sure that it solves your problem, but you could look into using jsdom , as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. 我不确定它是否可以解决您的问题，但是您可以考虑使用jsdom ，因为它可以获取和执行页面中的脚本，并在服务器端提供DOM。 Something like: 就像是：

var request = require('request'),
    jsdom = require('jsdom').jsdom;

request(url, function(err, response, body) {
  if(err) return console.error(err);

  var doc = jsdom(body, null, {
    FetchExternalResources: ['script', 'img']
  });
  var window = doc.createWindow();

  var images = doc.getElementsByTagName('img');
});

Answer 2

Use PhantomJS . 使用PhantomJS 。 It "is a headless WebKit with JavaScript API". 它“是带有JavaScript API的无头WebKit”。 Think of it like a whole browser you can control via a JavaScript API. 将其视为可以通过JavaScript API控制的整个浏览器。 As it is a browser it will fully execute the pages and then you can scrape them. 由于它是浏览器，它将完全执行页面，然后可以对其进行抓取。

It is somewhat similar to Node.js, but is really a full browser where your scripts have full access to the DOM of the page you have it pull down. 它有点类似于 Node.js，但实际上是一个完整的浏览器，您的脚本可以完全访问您将其下拉的页面的DOM。 So it is much easier to to 'scrape' a page intelligently by accessing the DOM using something like jQuery, instead of just accessing raw HTML. 因此，通过使用jQuery之类的方法访问DOM而不是仅访问原始HTML来智能地“擦除”页面要容易得多。

Here is an example on DOM manipulation 这是有关DOM操作的示例

在onload JS DOM注入后抓取页面

问题描述

2 个解决方案

解决方案1
0 已采纳 2012-03-14 17:28:24

解决方案2
0 2012-03-14 21:20:54

在onload JS DOM注入后抓取页面

问题描述

2 个解决方案

解决方案1 0 已采纳 2012-03-14 17:28:24

解决方案2 0 2012-03-14 21:20:54

解决方案1
0 已采纳 2012-03-14 17:28:24

解决方案2
0 2012-03-14 21:20:54