简体   繁体   中英

Java Replace Unicode Characters in a String

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \\ uF06C , and replace it with a back slash and four hexa digits without "u" in it.

Example :

Source String: "add \d1 Clause"

Result String: "add \\F06Cd1 Clause"

How can achieve this in Java?

Edit:

Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.

The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.

The regex to match the unicode-string:

A unicode-character looks like \ꯍ , so \\u\u003c/code> , followed by a 4-character hexnumber string. Matching these can be done using

\\u[A-Fa-f\d]{4}

But there's a problem with this:
In a String like "just some \\\ꯍ arbitrary text" the \\u\u003c/code> would still get matched. So we need to make sure the \\u\u003c/code> is preceeded by an even number of \\ s:

(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}

Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:

(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})

As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:

$1\\$3

Now for the actual code:

String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";

Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);

That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\\\\\" as a pattern-string in java matches one \\ as regex-matched character.

EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:

StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
   if(c > 127)
       sb.append("\\").append(String.format("%04x", (int) c));
   else
       sb.append(c);

This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\\n", "\\r", etc., which is why I chose it over other definitions.

Try using String.replaceAll() method

s = s.replaceAll("\\u\u0026quot;, "\\");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM