I need to add webpage scraping functionality to a single page application.
I need to retrieve useful content from many different blogs and services. By useful content, I mean articles, texts and links to videos in order to embed them on my pages.
This is tool seems to offer what I need: http://www.diffbot.com/
Using it, I can simply input an article's URL and this service will retrieve all data that I need from that single page.
However, I do not need to handle 250 thousands requests on a monthly basis, which would cost $300 each month; I need a solution to handle about 5000 requests each month, with the possibility of scaling later.
I've found a lot of scraping solutions through Google, but they mostly offer solutions which scrape custom content periodically from a small number of websites - this is not what I need. Also, I do not have experience in this area, so I would like you to advise me on what I should use for this purpose. I am primarily dealing with JavaScript.
In addition, is it at all possible to allow pages to be scraped by the client's browser, rather than server-side?
I develop SPA with ReactJS and Flux architecture. Server NodeJS+Express, database - Backendless
It sounds like a custom solution perhaps built on node.js will be your best bet (taking into consideration the js requirement). There are several node modules you could use to accomplish this. I would recommend the following:
request - Used to grab the html from target webpage
cheerio - Used to filter the html gathered from request
node-horseman - Used to execute javascript on the target web page (For more advanced scraping)
artoo - Client side scraping library (I've never used this but it may be what you are looking for)
As for the SPA development I would recommend sailsjs .
Here is an example node app using the above modules to scrape https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515
Request & Cheerio:
var cheerio = require('cheerio'),
request = require('request');
//Define your target URL and HTTP method here
var options = {
url: 'https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515',
method: 'GET'
}
// Use request to grab the HTML defined in options and return the contents in "body"
request(options, function (err, res, body) {
if (!err && res.statusCode == 200) {
// Load the "body" with cheerio
var $ = cheerio.load(body);
// grab each occurrence of the matched html (Use Chrome developer tools to determine CSS) using Cheerio
$('tbody').children().each(function(i, element){
var $element = $(element);
var name = $element.children().eq(0).text().trim();
var salary = $element.children().eq(3).text().trim();
// Put the filtered data in an object
var post = {
name: name,
salary: salary
}
// Print the object to the console
console.log(post);
});
}
});
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.