简体   繁体   English

jsoup-使用丢失和损坏的标签清理HTML

[英]jsoup - Cleaning HTML with missing and broken tags

I am looking for a way to clean HTML text that may have some missing or broken tags in them. 我正在寻找一种清洁HTML文本的方法,这些文本中可能有一些丢失或损坏的标记。 These are usually written by non-programmers and there can be a number of problems with the HTML. 这些通常是由非程序员编写的,HTML可能存在许多问题。 Here is what I've tried: 这是我尝试过的:

Parser p = Parser.htmlParser();
String test = "Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>";
Document d = p.parseInput(test, StringUtils.EMPTY);
System.out.println("BEFORE: " + test);
System.out.println("JSPARSED: " + StringUtils.remove(d.body().html(), "\n"));
System.out.println("JSOUP: "+ Jsoup.clean(test, StringUtils.EMPTY, Whitelist.relaxed()));

Output is: 输出为:

BEFORE: Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>
JSPARSED: Here is a <i>fake message.<br><b><i>- Publisher</i></b></i>
JSOUP: Here is a 
<i>fake message.<br><b><i>- Publisher</i></b></i>

The desired output is: 所需的输出是:

Here is a <i>fake</i> message.<br><b><i>- Publisher</i></b>

Is it possible to clean the HTML for the above situations using jsoup? 是否可以使用jsoup清理上述情况的HTML?

EDIT: To add a bit more context, this HTML block is displayed on our website as a description for a product. 编辑:要添加更多上下文,此HTML块将作为产品的描述显示在我们的网站上。 This is usually written by the marketing team or publisher and at times have some mistakes in the HTML. 这通常是由营销团队或发布商编写的,有时在HTML中有一些错误。 We currently use JTidy for HTML cleanup before displaying it on the website. 当前,我们将JTidy用于HTML清理,然后将其显示在网站上。

I recently ran a program to see how many products have an error in the description and found roughly 30,000 products with errors. 我最近运行了一个程序,以查看描述中有多少个产品有错误,发现大约30,000个产品有错误。 After reviewing some of them, I saw that the majority of the errors are tags in the wrong order (which the program fixes) but errors where tags are missing or broken as shown in the example, were not fixed as intended. 复习了其中的一些内容之后,我发现大多数错误是按错误顺序排列的标签(程序已修复),但如示例中所示标签丢失或损坏的错误并未按预期进行修复。

It is not likely you will ever get consistent results with autocorrecting 30k of malformed HTML snippets. 通过自动纠正30k格式错误的HTML代码段,您不太可能获得一致的结果。 Chances are, you will get even more screwed up content. 很有可能,您将获得更多搞砸的内容。

Do yourself a favor: 帮个忙:

  • Forbid to save broken HTML for new/edited descriptions, programmatically. 禁止以编程方式将损坏的HTML保存为新的/编辑的描述。
  • Hire someone to correct these manually (or delegate to marketing team that put errors in the first place). 雇用某人手动纠正这些错误(或委派给将错误放在首位的营销团队)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM