[英]Scrape page after onload JS DOM injection
I'm building a scraper that gets main images (based on Content-Length
right now) from a page. 我正在构建一个从页面获取主图像(当前基于
Content-Length
)的刮板。 It goes through all <img>
elements and makes a HEAD
request. 它遍历所有
<img>
元素并发出HEAD
请求。 But certain pages, esp. 但是某些页面,尤其是 mobile, have images inserted after page load.
移动设备,在页面加载后插入图片。 Any ideas on how to tackle this?
关于如何解决这个问题的任何想法?
I'm using node.js
. 我正在使用
node.js
I can't be sure that it solves your problem, but you could look into using jsdom , as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. 我不确定它是否可以解决您的问题,但是您可以考虑使用jsdom ,因为它可以获取和执行页面中的脚本,并在服务器端提供DOM。 Something like:
就像是:
var request = require('request'),
jsdom = require('jsdom').jsdom;
request(url, function(err, response, body) {
if(err) return console.error(err);
var doc = jsdom(body, null, {
FetchExternalResources: ['script', 'img']
});
var window = doc.createWindow();
var images = doc.getElementsByTagName('img');
});
Use PhantomJS . 使用PhantomJS 。 It "is a headless WebKit with JavaScript API".
它“是带有JavaScript API的无头WebKit”。 Think of it like a whole browser you can control via a JavaScript API.
将其视为可以通过JavaScript API控制的整个浏览器。 As it is a browser it will fully execute the pages and then you can scrape them.
由于它是浏览器,它将完全执行页面,然后可以对其进行抓取。
It is somewhat similar to Node.js, but is really a full browser where your scripts have full access to the DOM of the page you have it pull down. 它有点类似于 Node.js,但实际上是一个完整的浏览器,您的脚本可以完全访问您将其下拉的页面的DOM。 So it is much easier to to 'scrape' a page intelligently by accessing the DOM using something like jQuery, instead of just accessing raw HTML.
因此,通过使用jQuery之类的方法访问DOM而不是仅访问原始HTML来智能地“擦除”页面要容易得多 。
Here is an example on DOM manipulation 这是有关DOM操作的示例
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.