简体   繁体   中英

Scraping authenticated website in node.js

I want to scrape my college website (moodle) with node.js but I haven't found a headless browser able to do it. I have done it in python in just 10 lines of code using RoboBrowser:

from robobrowser import RoboBrowser
url = "https://cas.upc.edu/login?service=https%3A%2F%2Fatenea.upc.edu%2Fmoodle%2Flogin%2Findex.php%3FauthCAS%3DCAS"
browser = RoboBrowser()
browser.open(url)
form = browser.get_form()
form['username'] = 'myUserName'
form['password'] = 'myPassword'
browser.submit_form(form)
browser.open("http://atenea.upc.edu/moodle/")
print browser.parsed

The problem is that the website requires authentication. Can you help me? Thanks!

PD: I think this can be useful https://www.npmjs.com/package/form-scraper but I can't get it working.

Assuming you want to read a 3rd party website, and 'scrape' particular pieces of information, you could use a library such as cheerio to achieve this in Node.

Cheerio is a "lean implementation of core jQuery designed specifically for the server" . This means that given a String representation of a DOM (or part thereof), cheerio can traverse it in much the same way as jQuery can.

An example from Max Ogden show how you can use the request module to grab HTML from a remote server and then pass it to cheerio :

var $ = require('cheerio')
var request = require('request')

function gotHTML(err, resp, html) {
  if (err) return console.error(err)
  var parsedHTML = $.load(html)
  // get all img tags and loop over them
  var imageURLs = []
  parsedHTML('a').map(function(i, link) {
    var href = $(link).attr('href')
    if (!href.match('.png')) return
    imageURLs.push(domain + href)
  })
}

var domain = 'http://substack.net/images/'
request(domain, gotHTML)

Selenium支持多种语言、多种平台和多种浏览器。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM