简体   繁体   中英

jsoup - Cleaning HTML with missing and broken tags

I am looking for a way to clean HTML text that may have some missing or broken tags in them. These are usually written by non-programmers and there can be a number of problems with the HTML. Here is what I've tried:

Parser p = Parser.htmlParser();
String test = "Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>";
Document d = p.parseInput(test, StringUtils.EMPTY);
System.out.println("BEFORE: " + test);
System.out.println("JSPARSED: " + StringUtils.remove(d.body().html(), "\n"));
System.out.println("JSOUP: "+ Jsoup.clean(test, StringUtils.EMPTY, Whitelist.relaxed()));

Output is:

BEFORE: Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>
JSPARSED: Here is a <i>fake message.<br><b><i>- Publisher</i></b></i>
JSOUP: Here is a 
<i>fake message.<br><b><i>- Publisher</i></b></i>

The desired output is:

Here is a <i>fake</i> message.<br><b><i>- Publisher</i></b>

Is it possible to clean the HTML for the above situations using jsoup?

EDIT: To add a bit more context, this HTML block is displayed on our website as a description for a product. This is usually written by the marketing team or publisher and at times have some mistakes in the HTML. We currently use JTidy for HTML cleanup before displaying it on the website.

I recently ran a program to see how many products have an error in the description and found roughly 30,000 products with errors. After reviewing some of them, I saw that the majority of the errors are tags in the wrong order (which the program fixes) but errors where tags are missing or broken as shown in the example, were not fixed as intended.

It is not likely you will ever get consistent results with autocorrecting 30k of malformed HTML snippets. Chances are, you will get even more screwed up content.

Do yourself a favor:

  • Forbid to save broken HTML for new/edited descriptions, programmatically.
  • Hire someone to correct these manually (or delegate to marketing team that put errors in the first place).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM