[英]How to extract text from specific rows in nested tables with Jsoup
I'm using Jsoup to extract text from a website, and I can't figure out how to properly get specific rows of data in nested tables. 我正在使用Jsoup从网站提取文本,但是我不知道如何正确地获取嵌套表中的特定数据行。 I need to get the plain text after the parts that say Property Address: and Mailing Address: , so I can store the data.
我需要在表示“ 属性地址:”和“ 邮寄地址:”的部分之后获取纯文本,以便我可以存储数据。
Here is the HTML source I am parsing: 这是我正在解析的HTML源代码:
<table width="730" border="0" cellspacing="0" cellpadding="2">
<tr>
<td><table width="730" border="0" cellspacing="0" cellpadding="2">
<tr>
<td><h1>Property Information</h1>
<table width="758">
<tr>[IRRELEVANT]</tr>
<tr>[IRRELEVANT]</tr>
<tr>
<td colspan="3"><strong>Property Address:</strong> !!THIS PLAIN TEXT HERE IS WHAT I NEED!! DATA1</td>
<td> </td>
</tr>
<tr>
<td colspan="3"><strong>Mailing Address:</strong>!!NEED THIS TOO!! DATA2</td>
<td> </td>
</tr>
<tr>[IRRELEVANT]</tr>...................
I was using this as a template, but it doesn't work, and I have no idea how to make it work. 我当时使用它作为模板,但是它不起作用,我也不知道如何使它起作用。
Document documentSerialNumberPageData = Jsoup.connect(stringURLOfSerialNumberPage).get(); //connect to serial number page
Elements elementsSerialNumberPageData = documentSerialNumberPageData.select("#tabletext tbody > tr > td > tbody > tr > td > tbody > tr > td"); //this is not even remotely correct... :(
Element elementAddress = elementsSerialNumberPageData.get(0);
System.out.println(elementAddress.text());
My knowledge of HTML/CSS is very limited, but I'm proficient in Java. 我对HTML / CSS的了解非常有限,但是我精通Java。 Any suggestions?
有什么建议么? Thanks!
谢谢! Full Source Here: https://github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java
全文在这里: https : //github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java
You can try this: 您可以尝试以下方法:
Elements innerTable = documentSerialNumberPageData.select("body > table:nth-child(2) > tbody > tr > td > table > tbody > tr > td > table:nth-child(2)");
String propertyAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(3) > td > strong").first().nextSibling()).text();
String mailingAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(4) > td > strong").first().nextSibling()).text();
First, you select the table
, then you select the strong
tag in the first td
in the third tr
, then you pick the next sibling to that, you take the text()
in it and you are done. 首先,选择
table
,然后在第三个tr
的第一个td
中选择strong
标签,然后选择该标签的下一个同级,然后将text()
放入其中。 We do the same for the forth tr
. 我们对第四
tr
做同样的事情。
With text()
JSoup will translate the
JSoup将使用
text()
转换
into spaces, if you prefer not, you can also call toString()
. 放入空格,如果您不愿意,也可以调用
toString()
。
Hope that it helps. 希望对您有所帮助。
PS: Can I suggest a trick? PS:我可以建议一个把戏吗? You can use developer tools of Chrome or Firefox to find a tag in a html page, then right click and
Copy CSS Path
. 您可以使用Chrome或Firefox的开发人员工具在html页面中找到标签,然后右键单击并
Copy CSS Path
。 This will give you the selector you can use in JSoup! 这将为您提供可在JSoup中使用的选择器!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.