简体   繁体   中英

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,

htmlDom = parser.parse("<p>My First Heading</p><a href=\"www.google.com\">clk</a>");
Elements pe = htmlDom.select("html");
System.out.println("size  "+pe.size());

The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?

Thank you.

It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.

That is because Jsoup will try to conform the HTML5 Parsing Rules , and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:

Elements pe = htmlDom.select("head");
System.out.println("size  "+pe.size());

To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser , as below, which will parse the HTML as it is.

String customHtml = "<p>My First Heading</p><a href=\"www.google.com\">clk</a>";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());

So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM