简体   繁体   中英

How to print string with unicode characters missing backslashes?

I have a string as follows:

this is the string u00c5 with missing slash before unicode characters

It has unicode character codes but all the backslashes before the "u" is missing. How can print this string correctly?

What I have done?

I tried to add a backslash before the incomplete unicode part using the following code. However, "\\u$1\u0026quot; is not allowed in replaceAll .

public String sanitizeUnicodeQuirk(String input) {
    try {
        // String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", String.valueOf(Integer.parseInt("$1", 16)));    // $1 is taken literally which makes valuOf and parseInt useless
        String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", "\\\\u$1");    // Cannot make "\u$1"
        String newInput = new String(processedInput.getBytes(), "UTF-8");
        return newInput;
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }

    return input;
}

Yikes. Proof of concept using the possible duplicate link provided by @AlastairMcCormack in the comments:

public class Test {
    public static void main(String[] args) {
        String input = "this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters";
        System.out.println("Original input: " + input);
        Pattern pattern = java.util.regex.Pattern.compile("[uU][0-9a-fA-F]{4}");
        Matcher matcher = pattern.matcher(input);
        StringBuilder builder = new StringBuilder();
        int lastIndex = 0;
        while (matcher.find()) {
               String codePoint = matcher.group().substring(1);
               System.out.println("Found code point: " + codePoint);
               Character charSymbol = (char) Integer.parseInt(codePoint, 16);
               builder.append(input.substring(lastIndex, matcher.start()) + charSymbol);
               lastIndex = matcher.end();
        }
        builder.append(input.substring(lastIndex));
        System.out.println("Modded input: " + builder.toString());
    }
}

Yields:

Original input: this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters
Found code point: 0075
Found code point: 0031
Found code point: 0032
Found code point: 0033
Found code point: 0034
Modded input: this is the string u1234 with missing slash before unicode characters

It does make sense that the code point is encoded as a String of characters and no amount of simple scrubbing with regexes is going to fix that. It's not pretty so I'd be pretty happy too if someone had another way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM