简体   繁体   English

如何使用 jsoup 定位直接放置在 tr-tag 中的 div-tag

[英]How to locate div-tag placed directly in tr-tag using jsoup

Here is question scenario这是问题场景

<html>
    <body>
        <div class="specalClass">
            <table>
            <tbody id="mainTable">
                <tr><td>data 1</td></tr>
                <tr><div>Data</div></tr>
            </tbody>
            </table>
        </div>
    </body>
</html>

There is a problem, how to get div that is directly placed in tr tag, all element are traceable, except this div.有一个问题,如何获取直接放在tr标签中的div ,除了这个div之外,所有元素都是可追溯的。 This is just a sample code: we can not use XPath or div-tag directly, because the real page is a big one.这只是一个示例代码:我们不能直接使用 XPath 或 div-tag,因为真实页面很大。 We can get this table by its id and then need to iterate it.我们可以通过它的 id 获取这个表,然后需要对其进行迭代。

You can use CSS selector to get div which is direct child of tr in table with id mainTable :您可以使用 CSS 选择器来获取div ,它是 id mainTable的表中tr的直接子代:

    doc.select("#mainTable tr>div");

but it won't work because we have another problem here.但这不起作用,因为我们这里还有另一个问题。
div is not allowed in tr so Jsoup's HTML parser removes it because it follows the standard. tr中不允许使用div ,因此 Jsoup 的 HTML 解析器将其删除,因为它遵循标准。 To skip HTML validation you should parse the document using XML parser:要跳过 HTML 验证,您应该使用 XML 解析器解析文档:

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

It will keep the original structure and now that div will be reachable.它将保留原始结构,现在div将可以访问。

EDIT:编辑:
To counter the accusations of posting not working answer I'm pasting whole code with the output of both HTML and XML parsers.为了反驳发布无效答案的指控,我将使用 HTML 和 XML 解析器的 output 粘贴整个代码。
The first example doesn't work, but the second one according to my answer works fine:第一个示例不起作用,但根据我的回答,第二个示例可以正常工作:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

public class Stackoverflow62376512 {

    public static void main(final String[] args) {
        String html = "<html>\n" + 
                "    <body>\n" + 
                "        <div class=\"specalClass\">\n" + 
                "            <table>\n" + 
                "            <tbody id=\"mainTable\">\n" + 
                "                <tr><td>data 1</td></tr>\n" + 
                "                <tr><div>Data</div></tr>\n" + 
                "            </tbody>\n" + 
                "            </table>\n" + 
                "        </div>\n" + 
                "    </body>\n" + 
                "</html>";

        Document doc = Jsoup.parse(html, "", Parser.htmlParser()); // same as Jsoup.parse(html);
        System.out.println("Document parsed with HTML parser (div inside tr will be dropped): " + doc);
        System.out.println("Selecting div (this will fail and show null): " + doc.select("#mainTable tr>div").first());

        System.out.println("\n-------------\n");

        doc = Jsoup.parse(html, "", Parser.xmlParser());
        System.out.println("Document parsed with XML parser (div inside tr will be kept): " + doc);
        System.out.println("Selecting div (this one will succeed): " + doc.select("#mainTable tr>div").first());

    }
}

and the output is: output 是:

Document parsed with HTML parser (div inside tr will be dropped): <html>
 <head></head>
 <body> 
  <div class="specalClass"> 
   <div>
    Data
   </div>
   <table> 
    <tbody id="mainTable"> 
     <tr>
      <td>data 1</td>
     </tr> 
     <tr></tr> 
    </tbody> 
   </table> 
  </div>  
 </body>
</html>
Selecting div (this will fail and show null): null

-------------

Document parsed with XML parser (div inside tr will be kept): <html> 
 <body> 
  <div class="specalClass"> 
   <table> 
    <tbody id="mainTable"> 
     <tr>
      <td>data 1</td>
     </tr> 
     <tr>
      <div>
       Data
      </div>
     </tr> 
    </tbody> 
   </table> 
  </div> 
 </body> 
</html>
Selecting div (this one will succeed): <div>
 Data
</div>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM