Extracting Double Byte Characters/substring from a UTF-8 formatted String

Question

I'm trying to extract emojis and other special Characters from Strings for further processing (eg a String contains '😅' as one of its Characters).

But neither string.charAt(i) nor string.substring(i, i+1) work for me. The original String is formatted in UTF-8 and this means, that the escaped form of the above emoji is encoded as '\?\?'. That's why I receive '?' (\?) and '?' (\?) instead for this position, causing it to be at two positions when iterating over the String.

Does anyone have a solution to this problem?

Answer 1

Thanks to John Kugelman for the help. the solution looks like this now:

for(int codePoint : codePoints(string)) {

        char[] chars = Character.toChars(codePoint);
        System.out.println(codePoint + " : " + String.copyValueOf(chars));

    }

With the codePoints(String string)-method looking like this:

private static Iterable<Integer> codePoints(final String string) {
    return new Iterable<Integer>() {
        public Iterator<Integer> iterator() {
            return new Iterator<Integer>() {
                int nextIndex = 0;

                public boolean hasNext() {
                    return nextIndex < string.length();
                }

                public Integer next() {
                    int result = string.codePointAt(nextIndex);
                    nextIndex += Character.charCount(result);
                    return result;
                }

                public void remove() {
                    throw new UnsupportedOperationException();
                }
            };
        }
    };
}

Extracting Double Byte Characters/substring from a UTF-8 formatted String

Question

1 answers

solution1
1 ACCPTED 2015-06-15 06:24:24

Extracting Double Byte Characters/substring from a UTF-8 formatted String

Question

1 answers

solution1 1 ACCPTED 2015-06-15 06:24:24

solution1
1 ACCPTED 2015-06-15 06:24:24