[英]Parsing an html table without ID with JSoup in java
I am trying to process a large amount of data for a research project. 我正在尝试为研究项目处理大量数据。 I have a number of html files on my computer and I need to read some information into a java program.
我的计算机上有许多html文件,我需要将一些信息读入java程序。
I use Jsoup to load the document. 我使用Jsoup来加载文档。
Unfortunately the table in the html has no class or id (and there are multiple tables). 不幸的是,html中的表没有类或id(并且有多个表)。 I have searched stack, but all answers I find use table.class.
我搜索过堆栈,但我找到的所有答案都使用table.class。
How could I get the data (18/01/2014) from the table below? 我如何从下表中获取数据(18/01/2014)? The doc.select is not working now, because of the missing class I think
doc.select现在不能正常工作,因为我认为缺少课程
I am trying something like this:
Element table = doc.select("table").first();
Iterator<Element> ite = table.select("td").iterator();
ite.next();
System.out.println("Value 1: " + ite.next().text());
System.out.println("Value 2: " + ite.next().text());
System.out.println("Value 3: " + ite.next().text());
System.out.println("Value 4: " + ite.next().text());
<table border=0 cellpadding=0 cellspacing=0 width=650 height=18><tr><td class="header" style="color:#FFFFFF;"><table border=0 cellpadding=0 cellspacing=0><tr>
<td><img src="/images/title_ultratop.png"></td><td style="color:#FFFFFF;vertical-align:middle;"><b>50 DANCE<br>
<a href="link"><img src="/images/arr_bw.png" border=0 style="margin-bottom:1px;margin-right:3px;"></a>18/01/2014
</b></td></tr></table>
-- EDIT - 编辑
I found the table was inside another table. 我发现桌子在另一张桌子里面。 Using this code I could get it, BUT I only get 1 line now.
使用这段代码我可以得到它,但我现在只得到1行。 Just the table, I need to get one element out of it still.
就在桌子上,我需要从中获取一个元素。
Element table = doc.select("table table").first();
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
I guess I am displaying an entire table now. 我想我现在正在显示整个表格。 How to get the let's say 2nd element?
如何让我们说第二个元素?
There are some problems in your html. 您的HTML中存在一些问题。 I suppose the correct one is:
我想正确的是:
<table border="1" cellpadding="0" cellspacing="0" width="650" height="18">
<tr>
<td class="header" style="color:#FFFFFF;">
<table border="1" cellpadding="0" cellspacing="0">
<tr>
<td><img src="/images/title_ultratop.png"></td>
<td style="color:#FFFFFF;vertical-align:middle;">
<b>50 DANCE
<br>
<a href="link"><img src="/images/arr_bw.png" border="0"
style="margin-bottom:1px;margin-right:3px;"></a>
18/01/2014
</b>
</td>
</tr>
</table>
</td>
</tr>
</table>
In order to get that node you have to select: table table td b and then get the 4th child node (a text node): 为了获得该节点,您必须选择:table table td b然后获取第4个子节点(文本节点):
Elements td = doc.select("table table td b");
TextNode el = (TextNode)td.first().childNode(4);
System.out.println(el.text());
Right,a third embedded table and it works. 对,第三个嵌入式表,它的工作原理。
Element table = doc.select("table table").first(); 元素表= doc.select(“table table”)。first();
Still need to select a different table on the site as well. 仍然需要在网站上选择不同的表格。 I read about table:contains(word).
我读了关于table:contains(word)。 Hope that will word!
希望能说出来!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.