簡體   English   中英

如何從字符串中提取表情符號和字母字符

[英]How to extract emoji and alphabet characters from the string

我想從字符串中提取表情符號和字母字符到集合中,只是字符串具有任何類型的表情符號字符,例如活動,家庭,旗幟,動物符號,並且還具有字母字符。 當我從EditText獲得字符串時,它類似於“ AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭”。 我嘗試過,但不幸的是,獲得收集陣列不符合我的期望,所以有人可以建議我,我需要為預期的收集陣列做什么嗎?

如果我寫錯了,我使用Eclipse嘗試了這段代碼來糾正我

public class CodePoints {

    public static void main(String []args){
        List<String> list = new ArrayList<>();
        for(int codePoint : codePoints("AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭")) {
            list.add(String.valueOf(Character.toChars(codePoint)));
        }

        System.out.println(Arrays.toString(list.toArray()));
    }

    public static Iterable<Integer> codePoints(final String string) {
     return new Iterable<Integer>() {
       public Iterator<Integer> iterator() {
         return new Iterator<Integer>() {
           int nextIndex = 0;
           public boolean hasNext() {
             return nextIndex < string.length();
           }
           public Integer next() {
             int result = string.codePointAt(nextIndex);
             nextIndex += Character.charCount(result);
             return result;
           }
           public void remove() {
             throw new UnsupportedOperationException();
           }
         };
       }
     };
   }
}

輸出:
[A,B,😄,C,😊,D,👨,‍,👩,‍,👧,‍,👦,E,🏳,️,‍,🌈,‍,👭]

預期:
[A,B,😄,C,😊,D,👨‍👩‍👧‍👦,E,🏳️‍🌈‍,👭]

問題是您的字符串包含不可見的字符。
他們是:
Unicode字符'ZERO WIDTH JOINER'(U + 200D)
Unicode字符'VARIATION SELECTOR-16'(U + FE0F)
其他類似的是:
Unicode字符'SOFT HYPHEN'(U + 00AD)
...

Java字符是utf16編碼的,請參見: https ://en.wikipedia.org/wiki/UTF-16
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

這是一種迭代字符串中各個unicode字符的方法。

public static List<String> getUnicodeCharacters(String str) {
    List<String> result = new ArrayList<>();
    char charArray[] = str.toCharArray();
    for (int i = 0; i < charArray.length; ) {
        if (Character.isHighSurrogate(charArray[i])
                && (i + 1) < charArray.length
                && Character.isLowSurrogate(charArray[i + 1])) {
            result.add(new String(new char[]{charArray[i], charArray[i + 1]}));
            i += 2;
        } else {
            result.add(new String(new char[]{charArray[i]}));
            i++;
        }
    }
    return result;
}

@Test
void getUnicodeCharacters() {
    String str = "AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭";
    System.out.println(str.codePointCount(0, str.length()));
    for (String unicodeCharacter : UTF_16.getUnicodeCharacters(str)) {
        if ("\u200D".equals(unicodeCharacter)
                || "\uFE0F".equals(unicodeCharacter))
            continue;
        System.out.println(unicodeCharacter);
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM