简体   繁体   中英

How to locate div-tag placed directly in tr-tag using jsoup

Here is question scenario

<html>
    <body>
        <div class="specalClass">
            <table>
            <tbody id="mainTable">
                <tr><td>data 1</td></tr>
                <tr><div>Data</div></tr>
            </tbody>
            </table>
        </div>
    </body>
</html>

There is a problem, how to get div that is directly placed in tr tag, all element are traceable, except this div. This is just a sample code: we can not use XPath or div-tag directly, because the real page is a big one. We can get this table by its id and then need to iterate it.

You can use CSS selector to get div which is direct child of tr in table with id mainTable :

    doc.select("#mainTable tr>div");

but it won't work because we have another problem here.
div is not allowed in tr so Jsoup's HTML parser removes it because it follows the standard. To skip HTML validation you should parse the document using XML parser:

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

It will keep the original structure and now that div will be reachable.

EDIT:
To counter the accusations of posting not working answer I'm pasting whole code with the output of both HTML and XML parsers.
The first example doesn't work, but the second one according to my answer works fine:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

public class Stackoverflow62376512 {

    public static void main(final String[] args) {
        String html = "<html>\n" + 
                "    <body>\n" + 
                "        <div class=\"specalClass\">\n" + 
                "            <table>\n" + 
                "            <tbody id=\"mainTable\">\n" + 
                "                <tr><td>data 1</td></tr>\n" + 
                "                <tr><div>Data</div></tr>\n" + 
                "            </tbody>\n" + 
                "            </table>\n" + 
                "        </div>\n" + 
                "    </body>\n" + 
                "</html>";

        Document doc = Jsoup.parse(html, "", Parser.htmlParser()); // same as Jsoup.parse(html);
        System.out.println("Document parsed with HTML parser (div inside tr will be dropped): " + doc);
        System.out.println("Selecting div (this will fail and show null): " + doc.select("#mainTable tr>div").first());

        System.out.println("\n-------------\n");

        doc = Jsoup.parse(html, "", Parser.xmlParser());
        System.out.println("Document parsed with XML parser (div inside tr will be kept): " + doc);
        System.out.println("Selecting div (this one will succeed): " + doc.select("#mainTable tr>div").first());

    }
}

and the output is:

Document parsed with HTML parser (div inside tr will be dropped): <html>
 <head></head>
 <body> 
  <div class="specalClass"> 
   <div>
    Data
   </div>
   <table> 
    <tbody id="mainTable"> 
     <tr>
      <td>data 1</td>
     </tr> 
     <tr></tr> 
    </tbody> 
   </table> 
  </div>  
 </body>
</html>
Selecting div (this will fail and show null): null

-------------

Document parsed with XML parser (div inside tr will be kept): <html> 
 <body> 
  <div class="specalClass"> 
   <table> 
    <tbody id="mainTable"> 
     <tr>
      <td>data 1</td>
     </tr> 
     <tr>
      <div>
       Data
      </div>
     </tr> 
    </tbody> 
   </table> 
  </div> 
 </body> 
</html>
Selecting div (this one will succeed): <div>
 Data
</div>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM