简体   繁体   中英

Using Unicode regular expressions in Java to match any Unicode character

I am trying to use the Java regex matcher to search and replace. However, after it failed to match a certain string, I noticed that the expression ".*" seems to fail to match certain Unicode characters (in my case it was a \
 LINE SEPARATOR character).

This is what I have at the moment (match an XML element with any text in between):

String segSourceSearch = "<source(.?)>(.*?)</source>";
String segSourceReplace = "<source$1>$2</source><target$1>$2</target>";
myString = myString.replaceAll(segSourceSearch, segSourceReplace);

Basically, what this is supposed to do is duplicate the element. But how can I modify the regex (.*?) to match any Unicode character between <source> and </source> ? Is there a built-in pattern in Java? If not, is there anything in ICU4J that I could use? (I haven't been able to find a regex matcher in ICU4J).

Pattern.DOTALL :

Enables dotall mode. In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s).

So the pattern you are looking for is (?s).*? , for capturing you still have to enclose it in braces, ((?s).*?) , but you can also place the (?s) at the beginning of the entire expression to enable the DOTALL mode for the entire regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM