简体   繁体   English

使用Jsoup清理html时,保持属性具有特定值

[英]Keep attributes with certain value when cleaning html with Jsoup

I'm using this code to clean a messy html from word to strip it from essentially everything. 我正在使用这段代码从单词中清除一个凌乱的HTML,从基本上所有东西中删除它。 I want to keep only text formatting tags and text alignment 我想只保留文本格式标签和文本对齐方式

[...]
String result = null;
Document html = Jsoup.parse(rawHtml, "/");
html.select("span").unwrap();
Whitelist wl = Whitelist.simpleText();
wl.addTags("div", "span", "p"); // ”
wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = Jsoup.clean(html.body().html(), wl);
return result;

private void editStyle(Document html, String selector, String attrKey, String attrVal) {
    Elements values = html.select(selector);
    values.attr(attrKey, attrVal);
}

I know it's redudant to have both the align and the style attribute but i'm keeping it only for testing purposes, when i'll be able to fix this i'll remove the align attribute aswell. 我知道同时拥有对齐和样式属性是减少的,但我只是为了测试目的而保留它,当我能够解决这个问题时,我将删除对齐属性。

This of course doesn't keep the style attributes i add whenever i meet an align tag. 这当然不会保留我每次遇到对齐标签时添加的样式属性。 So what i want to achieve is to use clean to remove everything except styles containing exclusively a text-align value (that is, it will clean any other style attribute, even those that contain text-align and something else) 所以我想要实现的是使用clean来删除除了仅包含text-align值的样式之外的所有内容(也就是说,它将清除任何其他样式属性,甚至包含text-align和其他内容的样式属性)

I know that by changing the last part like this it works: 我知道通过更改这样的最后一部分它可以工作:

wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();

I get the raw html from clean, parse it again with jsoup, and add the attributes back calling editStyle at this point rather than before cleaning 我从clean获取原始html,再次使用jsoup解析它,并在此时添加属性,而不是在清理之前调用editStyle

But i wanted to know if there's some way to do it in only one step 但我想知道是否有一些方法只需一步即可完成

Since there was no answer to this i'm guessing this is not possible, so I just parse the document again after cleaning, as per the alternative solution i already posted in the question 由于没有答案,我猜这是不可能的,所以我只是在清理之后再次解析文档,根据我已经在问题中发布的替代解决方案

wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM