简体   繁体   中英

Remove html from only a part of string

I have the following code which should remove all HTML from a part of string, which is quoted by dollar signs (could be more of them). This works fine, but I also need to preserve those dollar signs. Any suggestions, thanks

private static String removeMarkupBetweenDollars(String input){
    if ((input.length()-input.replaceAll("\\$","").length())%2!=0)
    {
        throw new RuntimeException("Missing or extra: dollar");
    }
    Pattern pattern = Pattern.compile("\\$(.*?)\\$",Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);

    StringBuffer sb =new StringBuffer();

    while(matcher.find())
         { //prepending does NOT work, if sth. is in front of first dollar
        matcher.appendReplacement(sb,matcher.group(1).replaceAll("\\<.*?\\>", ""));
        sb.append("$"); //note this manual appending
    }
    matcher.appendTail(sb);
    System.out.println(sb.toString());

    return sb.toString();
}

Thanks for help!

        String input="<p>$<em>something</em>$</p>  <p>anything else</p>";
    String output="<p>$something$</p>  <p>anything else</p>";

More complicated input and output:

String input="<p>$ bar  <b>foo</b>  bar <span style=\"text-decoration: underline;\">foo</span>  $</p><p>another foos</p> $ foo bar <em>bar</em>$";
String output="<p>$ bar  foo  bar foo  $</p><p>another foos</p> $ foo bar bar$"

Just some minor tweaks to your code:

private static String removeMarkupBetweenDollars(String input) {
    if ((input.length() - input.replaceAll("\\$", "").length()) % 2 != 0) {
        throw new RuntimeException("Missing or extra: dollar");
    }

    Pattern pattern = Pattern.compile("\\$(.*?)\\$", Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);

    StringBuffer sb = new StringBuffer();

    while (matcher.find()) {
        String s = matcher.group().replaceAll("<[^>]+>", "");
        matcher.appendReplacement(sb, Matcher.quoteReplacement(s));
    }
    matcher.appendTail(sb);

    return sb.toString();
}
String output = input.replaceAll("\\$<.*?>(.*?)<.*?>\\$", "\\$$1\\$");

One key point in the regex is the ? in .*? - it means a "non greedy" match, which in turn means "consume the least possible input you can". Without this, the regex would try to consume as much as possible - up to the end of a subsequent occurrence of $<html>foo</html>$ in the input if one existed.

Here's a test:

public static void main(String[] args) throws Exception {
    String input = "<p>$<em>something</em>$</p> <p>and $<em>anything</em>$ else</p>";
    String output = input.replaceAll("\\$<.*?>(.*?)<.*?>\\$", "\\$$1\\$");
    System.out.println(output);
}

Output:

<p>$something$</p> <p>and $anything$ else</p>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM