I'm using Jsoup to extract text from a website, and I can't figure out how to properly get specific rows of data in nested tables. I need to get the plain text after the parts that say Property Address: and Mailing Address: , so I can store the data.
Here is the HTML source I am parsing:
<table width="730" border="0" cellspacing="0" cellpadding="2">
<tr>
<td><table width="730" border="0" cellspacing="0" cellpadding="2">
<tr>
<td><h1>Property Information</h1>
<table width="758">
<tr>[IRRELEVANT]</tr>
<tr>[IRRELEVANT]</tr>
<tr>
<td colspan="3"><strong>Property Address:</strong> !!THIS PLAIN TEXT HERE IS WHAT I NEED!! DATA1</td>
<td> </td>
</tr>
<tr>
<td colspan="3"><strong>Mailing Address:</strong>!!NEED THIS TOO!! DATA2</td>
<td> </td>
</tr>
<tr>[IRRELEVANT]</tr>...................
I was using this as a template, but it doesn't work, and I have no idea how to make it work.
Document documentSerialNumberPageData = Jsoup.connect(stringURLOfSerialNumberPage).get(); //connect to serial number page
Elements elementsSerialNumberPageData = documentSerialNumberPageData.select("#tabletext tbody > tr > td > tbody > tr > td > tbody > tr > td"); //this is not even remotely correct... :(
Element elementAddress = elementsSerialNumberPageData.get(0);
System.out.println(elementAddress.text());
My knowledge of HTML/CSS is very limited, but I'm proficient in Java. Any suggestions? Thanks! Full Source Here: https://github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java
You can try this:
Elements innerTable = documentSerialNumberPageData.select("body > table:nth-child(2) > tbody > tr > td > table > tbody > tr > td > table:nth-child(2)");
String propertyAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(3) > td > strong").first().nextSibling()).text();
String mailingAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(4) > td > strong").first().nextSibling()).text();
First, you select the table
, then you select the strong
tag in the first td
in the third tr
, then you pick the next sibling to that, you take the text()
in it and you are done. We do the same for the forth tr
.
With text()
JSoup will translate the
into spaces, if you prefer not, you can also call toString()
.
Hope that it helps.
PS: Can I suggest a trick? You can use developer tools of Chrome or Firefox to find a tag in a html page, then right click and Copy CSS Path
. This will give you the selector you can use in JSoup!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.