how to distinguish Unicode characters and ASCII characters

Question

I want to distinguish Unicode characters and ASCII characters from the below string:

abc\u263A\uD83D\uDE0A\uD83D\uDE22123

How can I distinguish characters? Can anyone help me with this issue? I have tried some code, but it crashes in some cases. What is wrong with my code?

The first three characters are abc , and the last three characters are 123 . The rest of the string is Unicode characters. I want to make a string array like this:

str[0] = 'a';
str[1] = 'b';
str[2] = 'c';
str[3] = '\u263A\uD83D';
str[4] = '\uDE0A\uD83D';
str[5] = '\uDE22';
str[6] = '1';
str[7] = '2';
str[8] = '3';

Code:

private String[] getCharArray(String unicodeStr) {
        ArrayList<String> list = new ArrayList<>();
        for (int i = 0; i < unicodeStr.length(); i++) {
            if (unicodeStr.charAt(i) == '\\') {
                list.add(unicodeStr.substring(i, i + 11));
                i = i + 11;
            } else {
                list.add(String.valueOf(unicodeStr.charAt(i)));
            }
        }
        return list.toArray(new String[list.size()]);
    }

Answer 1

It's not entirely clear what you're asking for, but if you want to tell if a specific character is ASCII, you can use Guava's ChatMatcher.ascii() .

if ( CharMatcher.ascii().matches('a') ) {
    System.out.println("'a' is ascii");
}
if ( CharMatcher.ascii().matches('\u263A\uD83D') ) {
    // this shouldn't be printed
    System.out.println("'\u263A\uD83D' is ascii");
}

Answer 2

ASCII characters exist in Unicode, they are Unicode codepoints U+0000 - U+007F, inclusive.

Java strings are represented in UTF-16, which is a 16-bit byte encoding of Unicode. Each Java char is a UTF-16 code unit. Unicode codepoints U+0000 - U+FFFF use 1 UTF-16 code unit and thus fit in a single char , whereas Unicode codepoints U+10000 and higher require a UTF-16 surrogate pair and thus need two char s.

If the string has UTF-16 code units represented as actual char values, then you can use Java's string methods that work with codepoints, eg:

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j;
    while (i < unicodeStr.length()) {
        j = unicodeStr.offsetByCodePoints(i, 1);
        list.add(unicodeStr.substring(i, j));
        i = j;
    }
    return list.toArray(new String[list.size()]);
}

On the other hand, if the string has UTF-16 code units represented in an encoded "\\uXXXX" format (ie, as 6 distinct characters - '\\' , 'u' , ...), then things get a little more complicated as you have to parse the encoded sequences manually.

If you want to preserve the "\\uXXXX" strings in your array, you could do something like this:

private boolean isUnicodeEncoded(string s, int index)
{
    return (
        (s.charAt(index) == '\\') &&
        ((index+5) < s.length()) &&
        (s.charAt(index+1) == 'u')
    );
}

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j, start;
    char ch;
    while (i < unicodeStr.length()) {
        start = i;
        if (isUnicodeEncoded(unicodeStr, i)) {
            ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
            j = 6;
        }
        else {
            ch = unicodeStr.charAt(i);
            j = 1;
        }
        i += j;
        if (Character.isHighSurrogate(ch) && (i < unicodeStr.length())) {
            if (isUnicodeEncoded(unicodeStr, i)) {
                ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
                j = 6;
            }
            else {
                ch = unicodeStr.charAt(i);
                j = 1;
            }
            if (Character.isLowSurrogate(ch)) {
                i += j;
            }
        }
        list.add(unicodeStr.substring(start, i));
    }
    return list.toArray(new String[list.size()]);
}

If you want to decode the "\\uXXXX" strings into actual chars in your array, you could do something like this instead:

private boolean isUnicodeEncoded(string s, int index)
{
    return (
        (s.charAt(index) == '\\') &&
        ((index+5) < s.length()) &&
        (s.charAt(index+1) == 'u')
    );
}

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j;
    char ch1, ch2;
    while (i < unicodeStr.length()) {
        if (isUnicodeEncoded(unicodeStr, i)) {
            ch1 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
            j = 6;
        }
        else {
            ch1 = unicodeStr.charAt(i);
            j = 1;
        }
        i += j;
        if (Character.isHighSurrogate(ch1) && (i < unicodeStr.length())) {
            if (isUnicodeEncoded(unicodeStr, i)) {
                ch2 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
                j = 6;
            }
            else {
                ch2 = unicodeStr.charAt(i);
                j = 1;
            }
            if (Character.isLowSurrogate(ch2)) {
                list.add(String.valueOf(new char[]{ch1, ch2}));
                i += j;
                continue;
            }
        }
        list.add(String.valueOf(ch1));
    }
    return list.toArray(new String[list.size()]);
}

Or, something like this (per https://stackoverflow.com/a/24046962/65863 ):

private String[] getCharArray(String unicodeStr) {
    Properties p = new Properties();
    p.load(new StringReader("key="+unicodeStr));
    unicodeStr = p.getProperty("key");
    ArrayList<String> list = new ArrayList<>();
    int i = 0;
    while (i < unicodeStr.length()) {
        if (Character.isHighSurrogate(unicodeStr.charAt(i)) &&
            ((i+1) < unicodeStr.length()) &&
            Character.isLowSurrogate(unicodeStr.charAt(i+1)))
        {
            list.add(unicodeStr.substring(i, i+2));
            i += 2;
        }
        else {
            list.add(unicodeStr.substring(i, i+1));
            ++i;
        }
    }
    return list.toArray(new String[list.size()]);
}

how to distinguish Unicode characters and ASCII characters

Question

2 answers

solution1
0 2017-08-14 10:03:25

solution2
0 ACCPTED 2017-08-17 02:26:43

how to distinguish Unicode characters and ASCII characters

Question

2 answers

solution1 0 2017-08-14 10:03:25

solution2 0 ACCPTED 2017-08-17 02:26:43

solution1
0 2017-08-14 10:03:25

solution2
0 ACCPTED 2017-08-17 02:26:43