简体   繁体   中英

How to extract emoji and alphabet characters from the string

I want to extract emoji and alphabet characters from the string to a collection, simply string has any type of emoji character like activity, family, flag, animal symbols and also have alphabet characters. when I got the string from EditText it is similar to "AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭". I tried but unfortunately getting collection array is not like my expectation so, can anyone suggest me, what I need to do for expected collection array?

Using Eclipse I tried this piece of code correct me if I am wrong

public class CodePoints {

    public static void main(String []args){
        List<String> list = new ArrayList<>();
        for(int codePoint : codePoints("AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭")) {
            list.add(String.valueOf(Character.toChars(codePoint)));
        }

        System.out.println(Arrays.toString(list.toArray()));
    }

    public static Iterable<Integer> codePoints(final String string) {
     return new Iterable<Integer>() {
       public Iterator<Integer> iterator() {
         return new Iterator<Integer>() {
           int nextIndex = 0;
           public boolean hasNext() {
             return nextIndex < string.length();
           }
           public Integer next() {
             int result = string.codePointAt(nextIndex);
             nextIndex += Character.charCount(result);
             return result;
           }
           public void remove() {
             throw new UnsupportedOperationException();
           }
         };
       }
     };
   }
}

Output:
[A, B, 😄, C, 😊, D, 👨, ‍, 👩, ‍, 👧, ‍, 👦, E, 🏳, ️, ‍, 🌈, ‍, 👭]

Expected:
[A, B, 😄, C, 😊, D, 👨‍👩‍👧‍👦, E, 🏳️‍🌈‍, 👭]

The problem is that your string contains invisible characters.
They are:
Unicode Character 'ZERO WIDTH JOINER' (U+200D)
Unicode Character 'VARIATION SELECTOR-16' (U+FE0F)
Other similar ones are:
Unicode Character 'SOFT HYPHEN' (U+00AD)
...

The java character is utf16 encoded, see: https://en.wikipedia.org/wiki/UTF-16
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

This is a method of iterating individual unicode characters in a string.

public static List<String> getUnicodeCharacters(String str) {
    List<String> result = new ArrayList<>();
    char charArray[] = str.toCharArray();
    for (int i = 0; i < charArray.length; ) {
        if (Character.isHighSurrogate(charArray[i])
                && (i + 1) < charArray.length
                && Character.isLowSurrogate(charArray[i + 1])) {
            result.add(new String(new char[]{charArray[i], charArray[i + 1]}));
            i += 2;
        } else {
            result.add(new String(new char[]{charArray[i]}));
            i++;
        }
    }
    return result;
}

@Test
void getUnicodeCharacters() {
    String str = "AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭";
    System.out.println(str.codePointCount(0, str.length()));
    for (String unicodeCharacter : UTF_16.getUnicodeCharacters(str)) {
        if ("\u200D".equals(unicodeCharacter)
                || "\uFE0F".equals(unicodeCharacter))
            continue;
        System.out.println(unicodeCharacter);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM