How to print string with unicode characters missing backslashes?

Question

I have a string as follows:

this is the string u00c5 with missing slash before unicode characters

It has unicode character codes but all the backslashes before the "u" is missing. How can print this string correctly?

What I have done?

I tried to add a backslash before the incomplete unicode part using the following code. However, "\\u$1\u0026quot; is not allowed in replaceAll .

public String sanitizeUnicodeQuirk(String input) {
    try {
        // String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", String.valueOf(Integer.parseInt("$1", 16)));    // $1 is taken literally which makes valuOf and parseInt useless
        String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", "\\\\u$1");    // Cannot make "\u$1"
        String newInput = new String(processedInput.getBytes(), "UTF-8");
        return newInput;
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }

    return input;
}

Answer 1

Yikes. Proof of concept using the possible duplicate link provided by @AlastairMcCormack in the comments:

public class Test {
    public static void main(String[] args) {
        String input = "this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters";
        System.out.println("Original input: " + input);
        Pattern pattern = java.util.regex.Pattern.compile("[uU][0-9a-fA-F]{4}");
        Matcher matcher = pattern.matcher(input);
        StringBuilder builder = new StringBuilder();
        int lastIndex = 0;
        while (matcher.find()) {
               String codePoint = matcher.group().substring(1);
               System.out.println("Found code point: " + codePoint);
               Character charSymbol = (char) Integer.parseInt(codePoint, 16);
               builder.append(input.substring(lastIndex, matcher.start()) + charSymbol);
               lastIndex = matcher.end();
        }
        builder.append(input.substring(lastIndex));
        System.out.println("Modded input: " + builder.toString());
    }
}

Yields:

Original input: this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters
Found code point: 0075
Found code point: 0031
Found code point: 0032
Found code point: 0033
Found code point: 0034
Modded input: this is the string u1234 with missing slash before unicode characters

It does make sense that the code point is encoded as a String of characters and no amount of simple scrubbing with regexes is going to fix that. It's not pretty so I'd be pretty happy too if someone had another way.

How to print string with unicode characters missing backslashes?

Question

1 answers

solution1
0 ACCPTED 2017-01-27 13:08:56

How to print string with unicode characters missing backslashes?

Question

1 answers

solution1 0 ACCPTED 2017-01-27 13:08:56

solution1
0 ACCPTED 2017-01-27 13:08:56