简体   繁体   English

使用Jsoup解析HTML时间表

[英]Parsing a HTML timetable using Jsoup

I understand that there are quite a few questions out there about parsing a HTML table. 我了解解析HTML表存在很多问题。 However, after doing some research and looking into Jsoup I'm a little stumped by it. 但是,在进行了一些研究并研究了Jsoup之后,我对此感到有些困惑。

I have a timetable 我有时间表

在此处输入图片说明

I want to parse to take out the text of the <td> tags, but keeping it in some sort of format. 我想解析出<td>标记的文本,但将其保留为某种格式。

By just messing around with Jsoup trying out the avaliable functions and looking at the Cookbook and the current API documentation. 通过与Jsoup搞混来试用可用的功能,并查看Cookbook和当前的API文档。 From this I have managed to do the following; 由此,我设法做到了以下几点;

Document doc = Jsoup.connect("http://crwnmis3.staffs.ac.uk/Reporting/Individual;Student%20Sets;name;L2SE?&template=Online%20One%20Page%20Student%20Set&days=1-5&periods=5-53&width=0&height=0").get();

String title = doc.select("td").text();      
System.out.println(title);

The only issue is this prints out one long string. 唯一的问题是这会打印出一个长字符串。

I'd much rather have the data split up into manageable chunks. 我宁愿将数据分成可管理的块。 Maybe I could do a title.Split(); 也许我可以做一个title.Split(); ?
However, that would mean no lecture has a time. 但是,这将意味着没有演讲时间。 Unless there is away of counting white spaces and doing a count for the time, assuming each white space is 15 minutes. 除非没有多余的空白计数和时间计数,否则假设每个空白为15分钟。

I would start by processing a row at each time. 我将从每次处理一行开始。 so i would start by getting the quarters of hour after each weekday, using a selector like 因此,我将首先使用每个选择器来获取每个工作日后的一刻钟

tr td.row-label-one:contains(Tue) ~ td

If you loop the contents of an array like ["Mon","Tue",..."Fri"] you can process the all week. 如果循环像[“ Mon”,“ Tue”,...“ Fri”]这样的数组的内容,则可以处理整周。

This css query will give you the td element siblings in that weekday. 此CSS查询将为您提供该工作日中的td元素同级。 and those siblings are the quarters of hour in that weekday. 这些兄弟姐妹是那个工作日中的每刻钟。

So just use 9am as you base and count until you find a non empty element like "COSE50582/Lec/Sem2 Object-Oriented Application Engineering Gillibrand D, Mansfield GD D116" 因此,只需使用上午9点作为基础并计数,直到找到一个非空元素,例如“ COSE50582 / Lec / Sem2面向对象的应用程序工程Gillibrand D,Mansfield GD D116”

You can find this element at index 4, so 9 + (15 min * 4 ) = 10 am 您可以在索引4处找到此元素,因此9 +(15分钟* 4)= 10 am

Note: For simplicity sake im assuming all subjects only have 4 quarters in duration otherwise you could use the colspan to calculate the subjects duration. 注意:为简单起见,假设所有主题的持续时间只有4个季度,否则您可以使用colspan计算主题持续时间。

You are selecting all elements that match "td" and printing one big string of it. 您正在选择所有与“ td”匹配的元素,并打印其中一个大字符串。 You can get them as a collection of elements and iterate over them one by one like this: 您可以将它们作为元素的集合来获取,并像这样一个接一个地遍历它们:

Document doc = Jsoup.connect("http://crwnmis3.staffs.ac.uk/Reporting/Individual;Student%20Sets;name;L2SE?&template=Online%20One%20Page%20Student%20Set&days=1-5&periods=5-53&width=0&height=0").get();
Elements titles = doc.getElementsByTag("td");      
for(Element e : titles) {
    System.out.println(e.text());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM