简体   繁体   中英

Help scraping HTML with JSoup

Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site.

I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup.

In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. The HTML of the source website appears like this:

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330."

Now onto the actual implementation:

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would appreciate some guidance on getting my code to function!

edit: GOT IT! Thank you all!

According to documentation you should be doing:

Document doc = Jsoup.connect(url).get();

The parse() method is for files.

I just downloaded JSoup and tried it out on your school's website and got this output:

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

Too flippin' cool! Vlad is right though; use the connect(...) method. 1+ to Vlad

Other suggestions and hints:
These are the constants that I used in my little program:

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

And these are the variables I used inside the scraping method:

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

Edit 1
Based on your recent comments, I think that you're over-thinking things a bit. What worked well for me is this simple algorithm:

  • Create the 3 variables I have listed above
  • Get my document as Vlad recommends.
  • Create a td Elements variable and assign to it all elements that have a td tag.
  • Use a for loop with int i going from 0 to < td.size() and get each Element, element using td.get(i);
  • Inside the loop check the element's class attribute.
  • If the attribute String equals the CLASS_ATTRIB_UNIT_NAME String (see above), get the element's text and use it to set the unitName variable.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NUM set the courseNumbName to the element's text.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NAME append the element's text to the courseNumbName String, add the String to the array list, and set courseNumbName = to "".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM