简体   繁体   中英

Jsoup.clean without adding html entities

I'm cleaning some text from unwanted HTML tags (such as <script> ) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

The problem is that it replaces for instance å with &aring; (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

yields

"hello &aring;  world"

but I would like

"hello å  world"

Is there a simple way to achieve this? (Ie simpler than converting &aring; back to å in the result.)

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText() :

// Parse str into a Document
Document doc = Jsoup.parse(str);

// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string of the body.
str = doc.body().html();

There are already feature requests on the website of Jsoup. You can extend source code yourself by adding a new empty Map and a new escaping type. If you don't want to do this you can use StringEscapeUtils from apache commons.

public static String getTextOnlyFromHtmlText(String htmlText){
    Document doc = Jsoup.parse( htmlText );
    doc.outputSettings().charset("UTF-8");
    htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
    htmlText = StringEscapeUtils.unescapeHtml(htmlText);
    return htmlText;
}

Answer from &bmoc is working fine, but you could use a shorter solution :

// Clean html
Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

A simpler way to do this is

// clean the html
String output = Jsoup.clean(html, Whitelist.basicWithImages());

// Parse string into a document
Document doc = Jsoup.parse(output);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string
System.out.println(doc.body().html());

I have tested this and it works

The accepted answer is using Jsoup.parse which seems more heavyweight than what is going on in Jsoup.clean after a quick glance at the source.

I copied the source code of Jsoup.clean(...) and added the line to set the escape mode. This should avoid some unecessary steps done by the parse method because it doesn't have to parse a whole html document but just handle a fragment.

private String clean(String html, Whitelist whitelist) {
    Document dirty = Jsoup.parseBodyFragment(html, "");
    Cleaner cleaner = new Cleaner(whitelist);
    Document clean = cleaner.clean(dirty);
    clean.outputSettings().escapeMode(EscapeMode.xhtml);
    return clean.body().html();
}

Simple way:

EscapeMode em = EscapeMode.xhtml;
em.getMap().clear();

doc.outputSettings().escapeMode(em);

This will remove ALL html entities, including these: ', ", & ,< and > . The EscapeMode.xhtml allows these entities.

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM