[英]alternative of JSoup or how to clean whitespaces
Does somebody know an alternative of JSoup ? 有人知道JSoup的另一种选择吗?
Or how to clean sequences like <p> </p>
? 或者如何清理像
<p> </p>
这样的序列?
HTML Clean plug-in for jQuery works well for me, but I'm interested in doing the html code cleaning at server side , not in the client side . 用于jQuery的HTML Clean插件适用于我,但我有兴趣在服务器端执行html代码清理,而不是在客户端 。
Or, what is the replaceAll expression to do??: 或者,要做的replaceAll表达式是什么?:
String cleanS = dirtyS.replaceAll("<p> </p>", ""); //This doesnt work
I have discovered that the dirty html comes with mixed sequences of blank spaces #160 , and others like #32 . 我发现脏HTML带有混合序列的空格#160 ,还有其他像#32 。
So, what I need is a expression to remove whatever mixture of them. 所以,我需要的是一个表达式,以删除它们的任何混合物。
You can change the OutputSettings
for this: 您可以为此更改
OutputSettings
:
Example: 例:
final String html = ...;
OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);
String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);
This is possible with a Document
parsed by Jsoup too: 这也可以通过Jsoup解析的
Document
来实现:
Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
// ...
Edit: 编辑:
Removing tags: 删除标签:
doc.select("p:matchesOwn((?is) )").remove();
Please note: after (?is)
there's not a blank, but char #160 (= nbsp). 请注意:在
(?is)
之后没有空白,但是#160 (= nbsp)。 This will remove all p-Tags who's own text is only a
这将删除所有自己的文字只是一个
p-Tags
. 。 If you want do so with all other tags, you can replace the
p:
with *:
. 如果要对所有其他标记执行此操作,可以将
p:
替换为*:
.
If you have the document object, you can loop over the paragrap elements and remove all those that don't have text (or non white space text) in them. 如果您有文档对象,则可以遍历paragrap元素并删除其中没有文本(或非空白文本)的所有元素。 before checking if the text is empty, you can replace the occurrences of NBSP;
在检查文本是否为空之前,可以替换NBSP的出现; with white space.
与白色空间。 Assuming your working ith UTF-8 documents the following might work for you:
假设您正在使用UTF-8文档,以下内容可能对您有用:
public static final String NBSP_IN_UTF8 = "\u00a0";
Assuming you know how to get the Document object, the loop to clean is simple: select the paragraph elements and remove empty ones: 假设您知道如何获取Document对象,则清理循环很简单:选择段落元素并删除空元素:
org.jsoup.nodes.Document doc= ... //obtain your document object
for (org.jsoup.nodes.Element element : doc.select("p")) {
if ( !element.hasText() || element.text().replaceAll(NBSP_IN_UTF8, "").trim().equals("") ) {
element.remove();
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.