简体   繁体   中英

Options for article scraping from many different websites

I need to add webpage scraping functionality to a single page application.

I need to retrieve useful content from many different blogs and services. By useful content, I mean articles, texts and links to videos in order to embed them on my pages.

This is tool seems to offer what I need: http://www.diffbot.com/

Using it, I can simply input an article's URL and this service will retrieve all data that I need from that single page.
However, I do not need to handle 250 thousands requests on a monthly basis, which would cost $300 each month; I need a solution to handle about 5000 requests each month, with the possibility of scaling later.

I've found a lot of scraping solutions through Google, but they mostly offer solutions which scrape custom content periodically from a small number of websites - this is not what I need. Also, I do not have experience in this area, so I would like you to advise me on what I should use for this purpose. I am primarily dealing with JavaScript.

In addition, is it at all possible to allow pages to be scraped by the client's browser, rather than server-side?

I develop SPA with ReactJS and Flux architecture. Server NodeJS+Express, database - Backendless

It sounds like a custom solution perhaps built on node.js will be your best bet (taking into consideration the js requirement). There are several node modules you could use to accomplish this. I would recommend the following:

request - Used to grab the html from target webpage

cheerio - Used to filter the html gathered from request

node-horseman - Used to execute javascript on the target web page (For more advanced scraping)

artoo - Client side scraping library (I've never used this but it may be what you are looking for)

As for the SPA development I would recommend sailsjs .

Here is an example node app using the above modules to scrape https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515

Request & Cheerio:

var cheerio = require('cheerio'),
    request = require('request');

//Define your target URL and HTTP method here
var options = {
  url: 'https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515',
  method: 'GET'
}

// Use request to grab the HTML defined in options and return the contents in "body"
request(options, function (err, res, body) {
  if (!err && res.statusCode == 200) {

    // Load the "body" with cheerio
    var $ = cheerio.load(body);

    // grab each occurrence of the matched html (Use Chrome developer tools to determine CSS) using Cheerio
    $('tbody').children().each(function(i, element){
      var $element = $(element);
      var name = $element.children().eq(0).text().trim();
      var salary = $element.children().eq(3).text().trim();

      // Put the filtered data in an object
      var post = {
        name: name,
        salary: salary
      }
      // Print the object to the console
      console.log(post);
    });
  }
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM