简体   繁体   中英

Scrape data from site with browser-based Template Engine

Trying to scrape data from page that templates in browser with a lot of JS. And when playing with jsdom can't get any data, maybe page doesn't have enough time to load or render. How to scrape data in this case: use timer or download all page by request

jsdom.env({
  url: link,
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    var date = $('.date').text();
    console.log(date);
  }
});

A colleague of mine has a PhantomJS-based project doing just that: https://github.com/vmeurisse/phantomCrawl .

He has a simple example that looks a lot like your snippet:

'use strict';

var PhantomCrawl = require('./src/PhantomCrawl');

var urls = [];

urls.push('http://www.bing.com');
var ptc = new PhantomCrawl({
    urls: urls,
    nbThreads: 4,
    crawlerPerThread: 4,
    maxDepth: 1
});

urls is the list of urls to crawl.

nbThreads is the number of instances of PhantomJS launched.

crawlerPerThread is the number of pages crawled in parallel per instance of PhantomJS.

maxDepth is the number of times the currently crawled page follows links present in the page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM