使用JSoup幫助抓取HTML

Question

這里有一點初學者，正在開展一個個人項目，將我的學校課程設置變成一個易於閱讀的表格格式，但是在從網站上抓取數據的最初步驟時遇到了麻煩。

我剛剛在eclipse中將JSoup庫添加到了我的項目中，在使用Jsoup的文檔時，我現在無法初始化連接。

最后，我的目標是獲取每個班級名稱/時間/描述，但是現在我想抓住這個名字。 源網站的HTML顯示如下：

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

我的第一個猜測是getElementsByTag（td），然后查詢這些元素的onclick =參數或'class'參數的值，通過刪除最初的“I”和后面的“SW”來清理它名稱“CS3330”。

現在進入實際實施：

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

在這一點上，我已經遇到了問題（盡管我並沒有偏離文檔中提供的示例），並希望得到一些關於讓我的代碼運行的指導！

編輯：GOT IT！ 謝謝你們！

Answer 1

根據你應該做的文件：

Document doc = Jsoup.connect(url).get();

parse()方法適用於文件。

Answer 2

我剛剛下載了JSoup，並在你學校的網站上試了一下，得到了這個輸出：

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

太酷了！ 弗拉德是對的; 使用connect（...）方法。 弗拉德1+

其他建議和提示：
這些是我在我的小程序中使用的常量：

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

這些是我在抓取方法中使用的變量：

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

編輯1
根據你最近的評論，我認為你過分思考了一些事情。 對我來說效果很好的是這個簡單的算法：

創建我上面列出的3個變量
像Vlad建議的那樣獲取我的文檔。
創建一個td Elements變量並為其分配具有td標記的所有元素。
使用for循環，使用int從0到<td.size（）並使用td.get(i);獲取每個Element元素td.get(i);
在循環內部檢查元素的class屬性。
如果屬性String等於CLASS_ATTRIB_UNIT_NAME字符串（參見上文），請獲取元素的文本並使用它來設置unitName變量。
如果屬性String等於CLASS_ATTRIB_COURSE_NUM，則將courseNumbName設置為元素的文本。
如果屬性String等於CLASS_ATTRIB_COURSE_NAME將元素的文本追加到courseNumbName字符串，則將String添加到數組列表，並將courseNumbName =設置為“”。

使用JSoup幫助抓取HTML

問題描述

2 個解決方案

解決方案1
3 2011-08-07 19:19:16

解決方案2
1 已采納 2011-08-07 20:02:39

使用JSoup幫助抓取HTML

問題描述

2 個解決方案

解決方案1 3 2011-08-07 19:19:16

解決方案2 1 已采納 2011-08-07 20:02:39

解決方案1
3 2011-08-07 19:19:16

解決方案2
1 已采納 2011-08-07 20:02:39