简体   繁体   中英

Convert “php unicode” to character

How can I convert so called "php unicode"( link to php unicode ) to normal character via Java? Example \\xEF\\xBC\\xA1 -> A. Are there any embedded methods in jdk or should I use regex for this conversion?

You first need to get the bytes out of the string into a byte-array without changing them and then decode the byte-array as a UTF-8 string.

The simplest way to get the string into a byte array is to encode it using ISO-8859-1 which map every character with a unicode value less than 256 to a byte with the same value (or the equivalent negative)

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1"); // maps to bytes with the same ordinal value
String javaString = new String(bytes, "UTF-8");
System.out.println(javaString);

Edit
The above converts the UTF-8 to the Unicode character. If you then want to convert it to a reasonable ASCII equivalent, there's no standard way of doing that: but see this question

Edit
I assumed that you had a string containing characters that had the same ordinal value as the UTF-8 sequence but you indicate that your string literally contains the escape sequence, as in:

String phpUnicode = "\\xEF\\xBC\\xA1";

The JDK doesn't have any built-in methods to convert Strings like this so you'll need to use your own regex. Since we ultimately want to convert a utf-8 byte-sequence into a String, we need to set up a byte-array, using maybe:

Pattern oneChar = Pattern.compile("\\\\x([0-9A-F]{2})|(.)", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = oneChar.matcher(phpUnicode);
ByteArrayOutputStream bytes = new ByteArrayOutputStream();

while (matcher.find()) {
    int ch;
    if (matcher.group(1) == null) {
        ch = matcher.group(2).charAt(0);
    }
    else {
        ch = Integer.parseInt(matcher.group(1), 16);
    }
    bytes.write((int) ch);
}
String javaString = new String(bytes.toByteArray(), "UTF-8");
System.out.println(javaString);

This will generate a UTF-8 stream by converting \\xAB sequences. This UTF-8 stream is then converted to a Java string. It's important to note that any character that is not part of an escape sequence will be converted to a byte equivalent to to the low-order 8 bites of the unicode character. This works fine for ascii but can cause transcoding problems for non-ascii characters.

@McDowell:
The sequence:

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1"); 

creates a byte array containing as many bytes as the original string has characters and for each character with a unicode value below 256, the same numeric value is stored in the byte-array.

The character FULLWIDTH LATIN CAPITAL LETTER A (U+FF41) is not present in the original String so the fact that it is not in ISO-8859-1 is irrelevant.

I know that transcoding bugs can occur when you convert characters to bytes that's why I said that ISO-8859-1 would only "map every character with a unicode value less than 256 to a byte with the same value"

The character in question is U+FF21 (FULLWIDTH LATIN CAPITAL LETTER A). The PHP form (\\xEF\\xBC\\xA1) is a UTF-8 encoded octet sequence.

In order to decode this sequence to a Java String (which is always UTF-16), you would use the following code:

// \xEF\xBC\xA1
byte[] utf8 = { (byte) 0xEF, (byte) 0xBC, (byte) 0xA1 };
String utf16 = new String(utf8, Charset.forName("UTF-8"));

// print the char as hex   
for(char ch : utf16.toCharArray()) {
    System.out.format("%02x%n", (int) ch);
}

If you want to decode the data from a string literal you could use code of this form:

public static void main(String[] args) {
  String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
  for (char ch : utf16.toCharArray()) {
    System.out.format("%s %02x%n", ch, (int) ch);
  }
}

private static final Pattern SEQ 
                           = Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");

private static String transformString(String encoded) {
  StringBuilder decoded = new StringBuilder();
  Matcher matcher = SEQ.matcher(encoded);
  int last = 0;
  while (matcher.find()) {
    decoded.append(encoded.substring(last, matcher.start()));
    byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
    decoded.append(new String(utf8, Charset.forName("UTF-8")));
    last = matcher.end();
  }
  return decoded.append(encoded.substring(last, encoded.length())).toString();
}

private static byte[] toByteArray(String hexSequence) {
  byte[] utf8 = new byte[hexSequence.length() / 4];
  for (int i = 0; i < utf8.length; i++) {
    int offset = i * 4;
    String hex = hexSequence.substring(offset + 2, offset + 4);
    utf8[i] = (byte) Integer.parseInt(hex, 16);
  }
  return utf8;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM