简体   繁体   中英

Jsoup attribute removal on html tags

I have the problem that i want to filter certain texts which may contain html. I use jsoup to whitelist and clean the tags which works pretty nice.

I only have the problem that some of the tags can contain attributes, mostly style or classes but there could also be different attributes. (name, target, ect.) When cleaning this is no problem because they get stripped nicely but when whitelisting some tags which would be allowed get blocked because of the attributes. The basic whitelist does not seem to cover style or class attributes plus i cannot be shure what else i'm encountering.

Since I want to allow quite a wide range of tags, but remove most of them during cleaning, I don't want to add all attributes for all tags that I'm allowing. The simplest would be to strip all attributes from all tags, since I'm not interested in them anyway and then check if the stripped text with the plain tags is valid.

Is there a function that removes all attributes or some simple loop, another option would be to tell the whitelister to ignore all attributes and simply whitelist on the tags.

The solution that finally worked for me is quite simple. I iterate through all elements, then iterate through all attributes and then remove them on the element, which leaves me with a cleaned version where i just have to validate the html-tags themselves. I think this is not the neatest way to solve the problem but it does what I wanted.

** EDIT **

I got upvoted many times for the old code while it actually contained an absolute beginners bug. You can never delete while iterating through the same list. This bug only triggered when more than one attribute was removed, however.

updated code with a bugFix:

Document doc = Jsoup.parseBodyFragment(aText);
Elements el = doc.getAllElements();
for (Element e : el) {
    List<String>  attToRemove = new ArrayList<>();
    Attributes at = e.attributes();
    for (Attribute a : at) {
        // transfer it into a list -
        // to be sure ALL data-attributes will be removed!!!
        attToRemove.add(a.getKey());
    }

    for(String att : attToRemove) {
        e.removeAttr(att);
   }
}


return Jsoup.isValid(doc.body().html(), theLegalWhitelist);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM