简体   繁体   中英

Java Regex Group Replacement Without Matching Group Offset Manipulation

I often face requirements like removing <p></p> tags from within an XHTML document, for a very specific type of subsequence. (One that disallows the use of String.replaceAll() ). Typically its of the pattern <p>${randomTextAndHTMLorJavascript}</p> but the one constant is that its always one arbitrary tag with lots of crap, followed by its ending tag. No tag nesting!

My question is if anyone is aware of a higher-level abstraction other than manually manipulating the Matcher object. In the past I've done these kinds of replacements:

  1. Treating the problem as an array copy, where I use a StringBuilder object and use the Matcher.start(int) and Matcher.end(int) methods to NOT copy the target group(s) from the input String . This works, but feels like C, not Java.

  2. Do a loop where I use the start token to find the first tag and take the result of match1.group() as input for the second Matcher to capture the end tag, and then use Matcher.replaceFirst() to handle the replacement in the input string itself. This has a drawback of needing to call Matcher.reset() forcing a reparse. (I only use this for throwaway scripts or if the input set is guaranteed to be tiny.)

  3. String.split() on one tag, replacing it except when matched by match1 and reconstructing the string with StringBuilder . Run the second Matcher against a token representing the end tag's sequence and do a String.replaceAll() before appending.

  4. I've also tried using the StringBuilder.deleteCharAt() method, but still feels too low-level for a language like Java.

What would be ideal would be a method signature like this:

Matcher.replaceGroup(int targetGroup, String pattern, String replacement);

Ultimately I'm hoping to replace a regex matching group in Java without needing to work with group/array offsets.

For XHTML (or other XML) documents, one (much) higher-level abstraction would be an XSL transform. They are far more expressive and powerful than regex, and they can work even if you do have internal structure to contend with.

Or if you want to keep the logic closer to Java, then why not use backreferences in the replacement string:

Pattern pat = Pattern.compile("(<p>keep )(stuff I don't want)( this</p>)");
Matcher m = p.matcher(input);

// Replace matches to the pattern with the same thing less "stuff I don't want":
String output = m.replaceAll("$1$3");

I know you said you can't use replaceAll() , but it's not clear to me why you could not accomplish exactly what you describe in your (1) (for example) via this approach.

Of course, backreferences work with Matcher.replaceFirst() , String.replaceAll() and String.replaceFirst() , too.

Edited to add:

If you want to step through iteratively, to do something more with the matches as they go by, then you should look into Matcher.appendReplacement() (with which you can also use backreferences) and Matcher.appendTail() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM