简体   繁体   中英

Displaying UTF-8 Emoji in Java

Say I have the 😈 (devil) emoji.

In 4-byte UTF-8, it's represented like so: \ð\Ÿ\˜\ˆ

However, in Java, it will only print correctly like so: \?\?

How would I convert from the first to the second?

UPDATE 2

MNEMO's answer is much simpler, and answers my question, so it's probably better to go with his solution.

UPDATE

Thanks Basil Bourque for the write-up. It was very interesting.

I found a good reference here: https://github.com/pRizz/Unicode-Converter/blob/master/conversionfunctions.js (particularly the convertUTF82Char() function).

For anyone wandering by here in the future, here's what that looks like in Java:

public static String fromCharCode(int n) {
    char c = (char)n;
    return Character.toString(c);
}

public static String decToChar(int n) {
    // converts a single string representing a decimal number to a character
    // note that no checking is performed to ensure that this is just a hex number, eg. no spaces etc
    // dec: string, the dec codepoint to be converted
    String result = "";
    if (n <= 0xFFFF) {
        result += fromCharCode(n);
    } else if (n <= 0x10FFFF) {
        n -= 0x10000;
        result += fromCharCode(0xD800 | (n >> 10)) + fromCharCode(0xDC00 | (n & 0x3FF));
    } else {
        result += "dec2char error: Code point out of range: " + decToHex(n);
    }

    return result;
}

public static String decToHex(int n) {
    return Integer.toHexString(n).toUpperCase();
}

public static String convertUTF8_toChar(String str) {
    // converts to characters a sequence of space-separated hex numbers representing bytes in utf8
    // str: string, the sequence to be converted
    var outputString = "";
    var counter = 0;
    var n = 0;

    // remove leading and trailing spaces
    str = str.replaceAll("/^\\s+/", "");
    str = str.replaceAll("/\\s+$/", "");
    if (str.length() == 0) {
        return "";
    }

    str = str.replaceAll("/\\s+/g", " ");

    var listArray = str.split(" ");
    for (var i = 0; i < listArray.length; i++) {
        int b = parseInt(listArray[i], 16); // alert('b:'+dec2hex(b));
        switch (counter) {
            case 0:
                if (0 <= b && b <= 0x7F) { // 0xxxxxxx
                    outputString += decToChar(b);
                } else if (0xC0 <= b && b <= 0xDF) { // 110xxxxx
                    counter = 1;
                    n = b & 0x1F;
                } else if (0xE0 <= b && b <= 0xEF) { // 1110xxxx
                    counter = 2;
                    n = b & 0xF;
                } else if (0xF0 <= b && b <= 0xF7) { // 11110xxx
                    counter = 3;
                    n = b & 0x7;
                } else {
                    outputString += "convertUTF82Char: error1 " + decToHex(b) + "! ";
                }
                break;
            case 1:
                if (b < 0x80 || b > 0xBF) {
                    outputString += "convertUTF82Char: error2 " + decToHex(b) + "! ";
                }
                counter--;
                outputString += decToChar((n << 6) | (b - 0x80));
                n = 0;
                break;
            case 2:
            case 3:
                if (b < 0x80 || b > 0xBF) {
                    outputString += "convertUTF82Char: error3 " + decToHex(b) + "! ";
                }
                n = (n << 6) | (b - 0x80);
                counter--;
                break;
        }
    }

    return outputString.replaceAll("/ $/", "");
}

Pretty much a 1-for-1 copy, but it accomplishes my goal.

The SMILING FACE WITH HORNS character (😈) is assigned to code point 128,520 decimal (1F608 hexadecimal ) in Unicode .

You have a choice in how to represent that number with a series of octets .

  • UTF-8 is one way to represent that number with a variable length, using 1-4 octets.
    • UTF-8 is becoming the dominant encoding in many spheres.
    • Java source code files are usually written in UTF-8, in my experience, and as discussed here .
  • UTF-16 is another way, also variable-length, but using either 2 octets or 4.
    • The Java language uses UTF-16 internally.
    • UTF-8 is generally preferred over UTF-16, as discussed here .

In most text-editors, you can simply paste the single character 😈 into your source code. When written to a UTF-8 file, the editor will create the necessary series of octets.

When writing this character to a text file, or otherwise serializing to a stream of octets, you can choose to use either UTF-8 or UTF-16. See:

The following are a couple of trials. You can examine the resulting files with a hex editor to see the octets.

UTF-8

This code generates a file in UTF-8 encoding. We find four octets, hex values F0 9F 98 88, decimal values 240 159 152 136.

You can find this code discussed at the Oracle Java Tutorial .

Notice how we specify an encoding for our file, StandardCharsets.UTF_8 .

Path file = Paths.get( "/Users/basilbourque/devil_utf-8.txt" );
Charset charset = StandardCharsets.UTF_8;
String s = "😈";
try ( BufferedWriter writer = Files.newBufferedWriter( file , charset ) )
{
    writer.write( s , 0 , s.length() );
}
catch ( IOException e )
{
    e.printStackTrace();
}

UTF-16

This code generates a file in UTF-16 encoding. We find 6 octets, 4 octets for our single character, plus a prefix of 2 octets for a BOM (FE FF). Our four octets in decimal are 216 061 222 008, in hex are D8 3D DE 08.

Same code as above, but we switched the Charset to StandardCharsets.UTF_16 .

Path file = Paths.get( "/Users/basilbourque/devil_utf-16.txt" );
Charset charset = StandardCharsets.UTF_16;
String s = "😈";
try ( BufferedWriter writer = Files.newBufferedWriter( file , charset ) )
{
    writer.write( s , 0 , s.length() );
}
catch ( IOException e )
{
    e.printStackTrace();
}

About Unicode and encodings

To learn the basics of Unicode and encodings, read the post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) .

well, this is quite unnecessary to add, but after you understand all character encoding system and Unicode concept, following code might work for you.

byte[] a = { (byte)0xf0, (byte)0x9f, (byte)0x98, (byte)0x88 };
String s = new String(a,"UTF-8");
byte[] b = s.getBytes("UTF-16BE");
for ( byte c : b ) { System.out.printf("%02x ",c); }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM