如何使用Jsoup从嵌套表中的特定行提取文本

Question

I'm using Jsoup to extract text from a website, and I can't figure out how to properly get specific rows of data in nested tables. 我正在使用Jsoup从网站提取文本，但是我不知道如何正确地获取嵌套表中的特定数据行。 I need to get the plain text after the parts that say Property Address: and Mailing Address: , so I can store the data. 我需要在表示“ 属性地址：”和“ 邮寄地址：”的部分之后获取纯文本，以便我可以存储数据。

Here is the HTML source I am parsing: 这是我正在解析的HTML源代码：

<table width="730" border="0" cellspacing="0" cellpadding="2">
  <tr> 
    <td><table width="730" border="0" cellspacing="0" cellpadding="2">
      <tr> 
        <td><h1>Property Information</h1>
          <table width="758">
            <tr>[IRRELEVANT]</tr>
            <tr>[IRRELEVANT]</tr>
            <tr>
              <td colspan="3"><strong>Property Address:</strong>&nbsp;!!THIS PLAIN TEXT HERE IS WHAT I NEED!! DATA1</td>
              <td>&nbsp;</td>
              </tr>
            <tr>
              <td colspan="3"><strong>Mailing Address:</strong>!!NEED THIS TOO!! DATA2</td>
              <td>&nbsp;</td>
              </tr>
            <tr>[IRRELEVANT]</tr>...................

I was using this as a template, but it doesn't work, and I have no idea how to make it work. 我当时使用它作为模板，但是它不起作用，我也不知道如何使它起作用。

Document documentSerialNumberPageData = Jsoup.connect(stringURLOfSerialNumberPage).get();   //connect to serial number page
Elements elementsSerialNumberPageData = documentSerialNumberPageData.select("#tabletext tbody > tr > td > tbody > tr > td > tbody > tr > td");  //this is not even remotely correct... :(
Element elementAddress = elementsSerialNumberPageData.get(0);
System.out.println(elementAddress.text());

My knowledge of HTML/CSS is very limited, but I'm proficient in Java. 我对HTML / CSS的了解非常有限，但是我精通Java。 Any suggestions? 有什么建议么？ Thanks! 谢谢！ Full Source Here: https://github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java 全文在这里： https : //github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java

Answer 1

You can try this: 您可以尝试以下方法：

Elements innerTable = documentSerialNumberPageData.select("body > table:nth-child(2) > tbody > tr > td > table > tbody > tr > td > table:nth-child(2)");
String propertyAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(3) > td > strong").first().nextSibling()).text();
String mailingAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(4) > td > strong").first().nextSibling()).text();

First, you select the table , then you select the strong tag in the first td in the third tr , then you pick the next sibling to that, you take the text() in it and you are done. 首先，选择table ，然后在第三个tr的第一个td中选择strong标签，然后选择该标签的下一个同级，然后将text()放入其中。 We do the same for the forth tr . 我们对第四tr做同样的事情。

With text() JSoup will translate the   JSoup将使用text()转换  into spaces, if you prefer not, you can also call toString() . 放入空格，如果您不愿意，也可以调用toString() 。

Hope that it helps. 希望对您有所帮助。

PS: Can I suggest a trick? PS：我可以建议一个把戏吗？ You can use developer tools of Chrome or Firefox to find a tag in a html page, then right click and Copy CSS Path . 您可以使用Chrome或Firefox的开发人员工具在html页面中找到标签，然后右键单击并Copy CSS Path 。 This will give you the selector you can use in JSoup! 这将为您提供可在JSoup中使用的选择器！

如何使用Jsoup从嵌套表中的特定行提取文本

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-10-28 20:55:43

如何使用Jsoup从嵌套表中的特定行提取文本

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-10-28 20:55:43

解决方案1
3 已采纳 2014-10-28 20:55:43