How to avoid surrounding html head tags in Jsoup parse

Question

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.

Sample Input:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Java code:

import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLParse {

    public static void main(String args[]) throws IOException {
        try{
            File input = new File("/ab.html");
            String html = FileUtils.readFileToString(input, null);

            Document doc = Jsoup.parseBodyFragment(html);
            doc.outputSettings().prettyPrint(false);
            System.out.println(doc.html());
        }
        catch(Exception e){
            e.printStackTrace();
        }
    }
}

Actual output:

<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
    </body></html>

Expected Output:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Please help.

Answer 1

The cause:

parseBodyFragment() as well as all other parse() -methods use a HTML parser by default . And those add always the HTML-Shell ( <html>…</html> , <head>…</head> etc.).

The Solution:

Just don't use a HTML-parser, use a XML-parser instead ;-)

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

Replace that single line and your problem is solved.

Example:

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";

Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());

System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("*******  XML *******\n" + docXml);

Output:

******* HTML *******
<html>
 <head></head>
 <body>
  <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
 </body>
</html>

*******  XML *******
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Answer 2

To get the expected output it would actually be:

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc.body().html());

Answer 3

You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like <img> and <br> . It's better to stick with the HTML parser. You can rely on there being <html> , <head> , and <body> tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.

Document doc = Jsoup.parseBodyFragment(html);
        doc.outputSettings().prettyPrint(false);
        System.out.println(doc.select("body").html());

Answer 4

You can use Jsoup.parse also with HTML parser. All you need to do is to strip the html and body wrappers away.

This can be done by selecting the body element and unwrapping it:

String input = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Node content = Jsoup.parse(input).body().unwrap();
System.out.println(content.html());

By body() you select body element and by unwrap() you remove body and only content remains.

So output is:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

How to avoid surrounding html head tags in Jsoup parse

Question

4 answers

solution1
22 ACCPTED 2014-10-03 15:24:49

The cause:

The Solution:

Example:

solution2
10 2015-09-18 06:00:24

solution3
9 2017-08-30 20:45:49

solution4
3 2020-11-24 12:59:20

How to avoid surrounding html head tags in Jsoup parse

Question

4 answers

solution1 22 ACCPTED 2014-10-03 15:24:49

The cause:

The Solution:

Example:

solution2 10 2015-09-18 06:00:24

solution3 9 2017-08-30 20:45:49

solution4 3 2020-11-24 12:59:20

solution1
22 ACCPTED 2014-10-03 15:24:49

solution2
10 2015-09-18 06:00:24

solution3
9 2017-08-30 20:45:49

solution4
3 2020-11-24 12:59:20