简体   繁体   中英

Node.js Scraping ASU Course

I'm pretty new to Node.js, so apologies in advance if I don't know what I'm talking about.

I'm trying to scrape some courses off ASU's course catalog (https://webapp4.asu.edu/catalog/) and have made numerous attempts using Zombie, Node.IO, and the HTTPS api. In both cases I've run into a redirect loop.

I'm wondering if it's because I'm not setting my headers properly?

Below is a sample code of what I used (not Zombie/Node.IO):

var https = require('https');

var option = {
  host: 'webapp4.asu.edu',
  path: '/catalog',
  method: 'GET',
  headers: {
    'set-cookie': 'onlineCampusSelection=C'
  }
};

var req = https.request(options, function(res) {
console.log("statusCode: ", res.statusCode);
console.log("headers: ", res.headers);
  res.on('data', function(d) {
    process.stdout.write(d);
  });
});

Just to clarify, I'm not having trouble with scraping with Node.js in general. More specifically, however, is ASU's course catalog that is giving me trouble.

Appreciate any ideas you guys could give me, thanks!

Update: My request successfully went through if I create a cookie with a JSESSIONID I got from Chrome/FF. Is there a way for me to request/create a JSESSIONID?

Id highly recommend using jsDOM in conjunction with jQuery(for node) . I've used it many many times for scaping as it makes it super easy.

heres the example from jsdom's readme:

// Count all of the links from the nodejs build page
var jsdom = require("jsdom");

jsdom.env("http://nodejs.org/dist/", [
  'http://code.jquery.com/jquery-1.5.min.js'
],
function(errors, window) {
  console.log("there have been", window.$("a").length, "nodejs releases!");
});

Hope that helps, jsdom has made it real easy to hack together scraping experiments (for me at least).

It looks like the server sets the JSESSIONID cookie and then redirects away, so you need to tell node.js not to follow redirects if you want to grab the cookie. I don't know how to do this with the http or https packages, but there is another package you can get via npm: request , which lets you do it. Here's a sample that should get you started:

var request = require("request");

var options = {
  url: "https://webapp4.asu.edu/catalog/",
  followredirect: false,
}

request.get(options, function(error, response, body) {
  console.log(response.headers['set-cookie']);
});

Output should look something like this:

[ 'JSESSIONID=B43CC3BB09FFCDE07AE6B3B702717431.catalog1; Path=/catalog; Secure' ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM