简体   繁体   中英

How to convert UTF-8 to unicode in Java?

For example, in Emoji Char set, U+1F601 is the unicode value for "GRINNING FACE WITH SMILING EYES", and \\xF0\\x9F\\x98\\x81 is the UTF-8 bytes value for this character.

\\xE2\\x9D\\xA4 is for heavy black heart, and the unicode is U+2764 .

So my question is, if I have a byte array with value (0xF0, 0x9F, 0x98, 0x81, 0xE2, 0x9D, 0xA4) , then how I can convert it into Unicode value?

For the above result, what I want is a String array with value "1F601" and "2764" .

I know I can write a complex method to do this work, but I hope there is already a library to do this work.

So my question is, if I have a byte array with value (0xF0, 0x9F, 0x98, 0x81), then how I can convert it into Unicode value?

Simply call the String constructor specifying the data and the encoding:

String text = new String(bytes, "UTF-8");

You can specify a Charset instead of the name of the encoding - I like Guava 's simple Charsets class, which allows you to write:

String text = new String(bytes, Charsets.UTF_8);

Or for Java 7, use StandardCharsets without even needing Guava:

String text = new String(bytes, StandardCharsets.UTF_8);

Simply use String class:

byte[] bytesArray = new byte[10]; // array of bytes (0xF0, 0x9F, 0x98, 0x81)

String string = new String(bytesArray, Charset.forName("UTF-8")); // covert byteArray

System.out.println(string); // Test result

Here is an example using InputStreamReader:

InputStream inputStream = new FileInputStream("utf-8-text.txt");
Reader      reader      = new InputStreamReader(inputStream,
                                                Charset.forName("UTF-8"));

int data = reader.read();
while(data != -1){
    char theChar = (char) data;
    data = reader.read();
}

reader.close();

Ref: Java I18N example

Here is a function to convert UNICODE (ISO_8859_1) to UTF-8

public static String String_ISO_8859_1To_UTF_8(String strISO_8859_1) {
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < strISO_8859_1.length(); i++) {
  final char ch = strISO_8859_1.charAt(i);
  if (ch <= 127) 
  {
      stringBuilder.append(ch);
  }
  else 
  {
      stringBuilder.append(String.format("%02x", (int)ch));
  }
}
String s = stringBuilder.toString();
int len = s.length();
byte[] data = new byte[len / 2];
for (int i = 0; i < len; i += 2) {
    data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
                         + Character.digit(s.charAt(i+1), 16));
}
String strUTF_8 =new String(data, StandardCharsets.UTF_8);
return strUTF_8;
}

TEST

String strA_ISO_8859_1_i = new String("الغلاف".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1);

System.out.println("ISO_8859_1 strA est = "+ strA_ISO_8859_1_i + "\n String_ISO_8859_1To_UTF_8 = " + String_ISO_8859_1To_UTF_8(strA_ISO_8859_1_i));

RESULT

ISO_8859_1 strA est = اÙغÙا٠String_ISO_8859_1To_UTF_8 = الغلاف

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM