简体   繁体   English

JSoup的替代方案或如何清理空格

[英]alternative of JSoup or how to clean whitespaces

Does somebody know an alternative of JSoup ? 有人知道JSoup的另一种选择吗?

Or how to clean sequences like <p>&nbsp;</p> ? 或者如何清理像<p>&nbsp;</p>这样的序列?

HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side , not in the client side . 用于jQuery的HTML Clean插件适用于我,但我有兴趣在服务器端执行html代码清理,而不是在客户端

Or, what is the replaceAll expression to do??: 或者,要做的replaceAll表达式是什么?:

String cleanS = dirtyS.replaceAll("<p>&nbsp;</p>", ""); //This doesnt work

I have discovered that the dirty html comes with mixed sequences of blank spaces #160 , and others like #32 . 我发现脏HTML带有混合序列的空格#160 ,还有其他像#32

So, what I need is a expression to remove whatever mixture of them. 所以,我需要的是一个表达式,以删除它们的任何混合物。

混合空间空白

You can change the OutputSettings for this: 您可以为此更改OutputSettings

Example: 例:

final String html = ...;


OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);

String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);

This is possible with a Document parsed by Jsoup too: 这也可以通过Jsoup解析的Document来实现:

Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);

// ...

Edit: 编辑:

Removing tags: 删除标签:

doc.select("p:matchesOwn((?is) )").remove();

Please note: after (?is) there's not a blank, but char #160 (= nbsp). 请注意:在(?is)之后没有空白,但是#160 (= nbsp)。 This will remove all p-Tags who's own text is only a &nbsp; 这将删除所有自己的文字只是一个&nbsp; p-Tags &nbsp; . If you want do so with all other tags, you can replace the p: with *: . 如果要对所有其他标记执行此操作,可以将p:替换为*: .

If you have the document object, you can loop over the paragrap elements and remove all those that don't have text (or non white space text) in them. 如果您有文档对象,则可以遍历paragrap元素并删除其中没有文本(或非空白文本)的所有元素。 before checking if the text is empty, you can replace the occurrences of NBSP; 在检查文本是否为空之前,可以替换NBSP的出现; with white space. 与白色空间。 Assuming your working ith UTF-8 documents the following might work for you: 假设您正在使用UTF-8文档,以下内容可能对您有用:

public static final String NBSP_IN_UTF8 = "\u00a0"; 

Assuming you know how to get the Document object, the loop to clean is simple: select the paragraph elements and remove empty ones: 假设您知道如何获取Document对象,则清理循环很简单:选择段落元素并删除空元素:

org.jsoup.nodes.Document doc= ...   //obtain your document object  
for (org.jsoup.nodes.Element element : doc.select("p")) {
    if ( !element.hasText() || element.text().replaceAll(NBSP_IN_UTF8, "").trim().equals("") ) {
       element.remove();
    }
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM