How do I convert unicode codepoints to their character representation?

Question

How do I convert strings representing code points to the appropriate character?

For example, I want to have a function which gets U+00E4 and returns ä .

I know that in the character class I have a function toChars(int codePoint) which takes an integer but there is no function which takes a string of this type.

Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?

Answer 1

Code points are written as hexadecimal numbers prefixed by U+

So,you can do this

int codepoint=Integer.parseInt(yourString.substring(2),16);
char[] ch=Character.toChars(codepoint);

Answer 2

Call this constructor on String .

"\u00E4"

new String(new int[] { 0x00E4 }, 0, 1);

Answer 3

Converted from Kotlin:

    public String codepointToString(int cp) {
        StringBuilder sb = new StringBuilder();
        if (Character.isBmpCodePoint(cp)) {
            sb.append((char) cp);
        } else if (Character.isValidCodePoint(cp)) {
            sb.append(Character.highSurrogate(cp));
            sb.append(Character.lowSurrogate(cp));
        } else {
            sb.append('?');
        }
        return sb.toString();
    }

Answer 4

The question asked for a function to convert a string value representing a Unicode code point (ie "+Unnnn" rather than the Java formats of "\\unnnn" or "0xnnnn ). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

The introduction of Streams in Java 8.
Method public static String toString(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[] , so Character.toString(0x00E4) returns "ä" .

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() {

    // Create a test string containing "Hello World 😁" with code points in Unicode format.
    // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
    String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";

    String text = Arrays.stream(data.split("\\+U"))
            .filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
            .map(s -> {
                try {
                    return Integer.parseInt(s, 16);
                } catch (NumberFormatException e) { 
                    System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
                }
                return null; // If the code point is not represented as a valid hex String.
            })
            .filter(v -> v != null) // Ignore syntactically invalid code points.
            .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
            .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
            .collect(Collectors.joining());

    System.out.println(text); // Prints "Hello World 😁"
}

And this is the output:

run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World 😁
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the Stream processing. Of course the same code could still be used to process just a single code point in Unicode format.
It's easy to add intermediate operations to perform further validation and processing on the Stream , such as case conversion, removal of emoticons, etc.

Answer 5

this example does not use char[].

// this code is Kotlin, but you can write same thing in Java
val sb = StringBuilder()
val cp :Int // codepoint
when {
    Character.isBmpCodePoint(cp) -> sb.append(cp.toChar())
    Character.isValidCodePoint(cp) -> {
        sb.append(Character.highSurrogate(cp))
        sb.append(Character.lowSurrogate(cp))
    }
    else -> sb.append('?')
}

Answer 6

You can print them

s='\u0645\u0635\u0631\u064a'
print(s)

Answer 7

The easiest way I've found so far is to just cast the codepoint; if you're just expecting a single char per codepoint, then this might be fine for you.:

int codepoint = ...;
char c = (char)codepoint;

How do I convert unicode codepoints to their character representation?

Question

7 answers

solution1
35 ACCPTED 2013-08-22 12:51:11

solution2
8 2013-08-22 12:54:34

solution3
6 2019-01-26 07:27:43

solution4
5 2019-12-22 07:13:10

solution5
2 2018-01-18 15:25:35

solution6
-4 2018-03-01 18:22:22

solution7
-6 2018-01-21 10:46:47

How do I convert unicode codepoints to their character representation?

Question

7 answers

solution1 35 ACCPTED 2013-08-22 12:51:11

solution2 8 2013-08-22 12:54:34

solution3 6 2019-01-26 07:27:43

solution4 5 2019-12-22 07:13:10

solution5 2 2018-01-18 15:25:35

solution6 -4 2018-03-01 18:22:22

solution7 -6 2018-01-21 10:46:47

solution1
35 ACCPTED 2013-08-22 12:51:11

solution2
8 2013-08-22 12:54:34

solution3
6 2019-01-26 07:27:43

solution4
5 2019-12-22 07:13:10

solution5
2 2018-01-18 15:25:35

solution6
-4 2018-03-01 18:22:22

solution7
-6 2018-01-21 10:46:47