简体   繁体   中英

Java: remove < and > from text in XML (not tags)

I'm having a hard time escaping xml to be processed by Java. I'm using JTidy to escape unwanted characters, but struggle to remove "<" and ">" from values such as <tag> capacity < 1000 </tag>

I'm using below code to escape the input

    public String CleanXML(String input){

        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-16");
        tidy.setOutputEncoding("UTF-16");
        tidy.setWraplen(Integer.MAX_VALUE);
        tidy.setXmlOut(true);
        tidy.setSmartIndent(true);
        tidy.setXmlTags(true);
        tidy.setMakeClean(true);
        tidy.setForceOutput(true);
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        StringReader in = new StringReader(input);
        StringWriter out = new StringWriter();
        tidy.parse(in, out);

        return out.toString();
    }

use following function

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

public String CleanXML(String input){
    final Matcher matcher = TAG_REGEX.matcher(input);
    while (matcher.find()) {
        String value = matcher.group(1);
        String valueReplace = value.replaceAll("[^a-zA-Z0-9\\s]", "");
        input.replace(value,valueReplace);
    }
    return input;        
}

It uses regular expression search to get values between tags then, remove all non alphanumeric characters. Regular expressions and basic idea was gained from Java regex to extract text between tags

If you want to remove tag terminals of XML, just convert it to a map and build string as you required refer XML to map in Java .

If you want to clean attribute values, you can iterate map and clean it then build a string or re convert it to the XML by map to XML in java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM