Jsoup.clean without adding html entities

Question

I'm cleaning some text from unwanted HTML tags (such as <script> ) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

The problem is that it replaces for instance å with å (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

yields

"hello &aring;  world"

but I would like

"hello å  world"

Is there a simple way to achieve this? (Ie simpler than converting å back to å in the result.)

Answer 1

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText() :

// Parse str into a Document
Document doc = Jsoup.parse(str);

// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string of the body.
str = doc.body().html();

Answer 2

There are already feature requests on the website of Jsoup. You can extend source code yourself by adding a new empty Map and a new escaping type. If you don't want to do this you can use StringEscapeUtils from apache commons.

public static String getTextOnlyFromHtmlText(String htmlText){
    Document doc = Jsoup.parse( htmlText );
    doc.outputSettings().charset("UTF-8");
    htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
    htmlText = StringEscapeUtils.unescapeHtml(htmlText);
    return htmlText;
}

Answer 3

Answer from &bmoc is working fine, but you could use a shorter solution :

// Clean html
Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

Answer 4

A simpler way to do this is

// clean the html
String output = Jsoup.clean(html, Whitelist.basicWithImages());

// Parse string into a document
Document doc = Jsoup.parse(output);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string
System.out.println(doc.body().html());

I have tested this and it works

Answer 5

The accepted answer is using Jsoup.parse which seems more heavyweight than what is going on in Jsoup.clean after a quick glance at the source.

I copied the source code of Jsoup.clean(...) and added the line to set the escape mode. This should avoid some unecessary steps done by the parse method because it doesn't have to parse a whole html document but just handle a fragment.

private String clean(String html, Whitelist whitelist) {
    Document dirty = Jsoup.parseBodyFragment(html, "");
    Cleaner cleaner = new Cleaner(whitelist);
    Document clean = cleaner.clean(dirty);
    clean.outputSettings().escapeMode(EscapeMode.xhtml);
    return clean.body().html();
}

Answer 6

Simple way:

EscapeMode em = EscapeMode.xhtml;
em.getMap().clear();

doc.outputSettings().escapeMode(em);

This will remove ALL html entities, including these: ', ", & ,< and > . The EscapeMode.xhtml allows these entities.

Answer 7

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

Jsoup.clean without adding html entities

Question

7 answers

solution1
35 ACCPTED 2012-05-11 12:49:05

solution2
11 2012-02-16 15:08:12

solution3
5 2017-03-24 08:45:49

solution4
2 2013-01-06 06:47:11

solution5
2 2014-02-03 10:17:27

solution6
1 2015-06-26 21:17:03

solution7
0 2011-12-30 19:20:01

Jsoup.clean without adding html entities

Question

7 answers

solution1 35 ACCPTED 2012-05-11 12:49:05

solution2 11 2012-02-16 15:08:12

solution3 5 2017-03-24 08:45:49

solution4 2 2013-01-06 06:47:11

solution5 2 2014-02-03 10:17:27

solution6 1 2015-06-26 21:17:03

solution7 0 2011-12-30 19:20:01

solution1
35 ACCPTED 2012-05-11 12:49:05

solution2
11 2012-02-16 15:08:12

solution3
5 2017-03-24 08:45:49

solution4
2 2013-01-06 06:47:11

solution5
2 2014-02-03 10:17:27

solution6
1 2015-06-26 21:17:03

solution7
0 2011-12-30 19:20:01