[英]Help scraping HTML with JSoup
Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site. 这里有一点初学者,正在开展一个个人项目,将我的学校课程设置变成一个易于阅读的表格格式,但是在从网站上抓取数据的最初步骤时遇到了麻烦。
I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup. 我刚刚在eclipse中将JSoup库添加到了我的项目中,在使用Jsoup的文档时,我现在无法初始化连接。
In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. 最后,我的目标是获取每个班级名称/时间/描述,但是现在我想抓住这个名字。 The HTML of the source website appears like this:
源网站的HTML显示如下:
<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')
My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330." 我的第一个猜测是getElementsByTag(td),然后查询这些元素的onclick =参数或'class'参数的值,通过删除最初的“I”和后面的“SW”来清理它名称“CS3330”。
Now onto the actual implementation: 现在进入实际实施:
Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");
At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would appreciate some guidance on getting my code to function! 在这一点上,我已经遇到了问题(尽管我并没有偏离文档中提供的示例),并希望得到一些关于让我的代码运行的指导!
edit: GOT IT! 编辑:GOT IT! Thank you all!
谢谢你们!
According to documentation you should be doing: 根据你应该做的文件 :
Document doc = Jsoup.connect(url).get();
The parse()
method is for files. parse()
方法适用于文件。
I just downloaded JSoup and tried it out on your school's website and got this output: 我刚刚下载了JSoup,并在你学校的网站上试了一下,得到了这个输出:
Unit: Computer Science
CS 1010: Introduction to Information Technology
CS 1110: Introduction to Programming
CS 1111: Introduction to Programming
CS 1112: Introduction to Programming
CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
CS 2102: Discrete Mathematics I
CS 2110: Software Development Methods
CS 2150: Program and Data Representation
CS 2220: Engineering Software
CS 2330: Digital Logic Design
CS 2501: Special Topics in Computer Science
CS 3102: Theory of Computation
CS 3330: Computer Architecture
CS 4102: Algorithms
CS 4240: Principles of Software Design
CS 4414: Operating Systems
CS 4444: Introduction to Parallel Computing
CS 4457: Computer Networks
CS 4501: Special Topics in Computer Science
CS 4753: Electronic Commerce Technologies
CS 4810: Introduction to Computer Graphics
CS 4993: Independent Study
CS 4998: Distinguished BA Majors Research
CS 6161: Design and Analysis of Algorithms
CS 6190: Computer Science Perspectives
CS 6354: Computer Architecture
CS 6444: Introduction to Parallel Computing
CS 6501: Special Topics in Computer Science
CS 6610: Programming Languages
CS 7457: Computer Networks
CS 7993: Independent Study
CS 7995: Supervised Project Research
CS 8501: Special Topics in Computer Science
CS 8524: Topics in Software Engineering
CS 8897: Graduate Teaching Instruction
CS 8999: Thesis
CS 9999: Dissertation
Too flippin' cool! 太酷了! Vlad is right though;
弗拉德是对的; use the connect(...) method.
使用connect(...)方法。 1+ to Vlad
弗拉德1+
Other suggestions and hints: 其他建议和提示:
These are the constants that I used in my little program: 这些是我在我的小程序中使用的常量:
private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
"page.php?Semester=1118&Type=Group&Group=CompSci";
private static final String TD_TAG = "td";
private static final String CLASS_ATTRIB = "class";
private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";
And these are the variables I used inside the scraping method: 这些是我在抓取方法中使用的变量:
String unitName = "";
List<String> courseNumbNameList = new ArrayList<String>();
String courseNumbName = "";
Edit 1 编辑1
Based on your recent comments, I think that you're over-thinking things a bit. 根据你最近的评论,我认为你过分思考了一些事情。 What worked well for me is this simple algorithm:
对我来说效果很好的是这个简单的算法:
td.get(i);
td.get(i);
获取每个Element元素td.get(i);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.