Open source java library for HTML to text conversion

Question

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)

Answer 1

Try Jericho .

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

Answer 2

HtmlUnit ，它甚至在处理JavaScript / Ajax后显示页面。

Answer 3

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

Answer 4

I use TagSoup , it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

Answer 5

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils .

Open source java library for HTML to text conversion

Question

5 answers

solution1
19 ACCPTED 2009-10-05 12:14:16

solution2
3 2009-10-05 07:37:12

solution3
2 2016-04-03 07:21:43

solution4
0 2009-10-05 07:57:16

solution5
-1 2013-02-26 18:41:39

Open source java library for HTML to text conversion

Question

5 answers

solution1 19 ACCPTED 2009-10-05 12:14:16

solution2 3 2009-10-05 07:37:12

solution3 2 2016-04-03 07:21:43

solution4 0 2009-10-05 07:57:16

solution5 -1 2013-02-26 18:41:39

solution1
19 ACCPTED 2009-10-05 12:14:16

solution2
3 2009-10-05 07:37:12

solution3
2 2016-04-03 07:21:43

solution4
0 2009-10-05 07:57:16

solution5
-1 2013-02-26 18:41:39