繁体   English   中英

如何从字符串中提取表情符号和字母字符

[英]How to extract emoji and alphabet characters from the string

我想从字符串中提取表情符号和字母字符到集合中,只是字符串具有任何类型的表情符号字符,例如活动,家庭,旗帜,动物符号,并且还具有字母字符。 当我从EditText获得字符串时,它类似于“ AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭”。 我尝试过,但不幸的是,获得收集阵列不符合我的期望,所以有人可以建议我,我需要为预期的收集阵列做什么吗?

如果我写错了,我使用Eclipse尝试了这段代码来纠正我

public class CodePoints {

    public static void main(String []args){
        List<String> list = new ArrayList<>();
        for(int codePoint : codePoints("AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭")) {
            list.add(String.valueOf(Character.toChars(codePoint)));
        }

        System.out.println(Arrays.toString(list.toArray()));
    }

    public static Iterable<Integer> codePoints(final String string) {
     return new Iterable<Integer>() {
       public Iterator<Integer> iterator() {
         return new Iterator<Integer>() {
           int nextIndex = 0;
           public boolean hasNext() {
             return nextIndex < string.length();
           }
           public Integer next() {
             int result = string.codePointAt(nextIndex);
             nextIndex += Character.charCount(result);
             return result;
           }
           public void remove() {
             throw new UnsupportedOperationException();
           }
         };
       }
     };
   }
}

输出:
[A,B,😄,C,😊,D,👨,‍,👩,‍,👧,‍,👦,E,🏳,️,‍,🌈,‍,👭]

预期:
[A,B,😄,C,😊,D,👨‍👩‍👧‍👦,E,🏳️‍🌈‍,👭]

问题是您的字符串包含不可见的字符。
他们是:
Unicode字符'ZERO WIDTH JOINER'(U + 200D)
Unicode字符'VARIATION SELECTOR-16'(U + FE0F)
其他类似的是:
Unicode字符'SOFT HYPHEN'(U + 00AD)
...

Java字符是utf16编码的,请参见: https ://en.wikipedia.org/wiki/UTF-16
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

这是一种迭代字符串中各个unicode字符的方法。

public static List<String> getUnicodeCharacters(String str) {
    List<String> result = new ArrayList<>();
    char charArray[] = str.toCharArray();
    for (int i = 0; i < charArray.length; ) {
        if (Character.isHighSurrogate(charArray[i])
                && (i + 1) < charArray.length
                && Character.isLowSurrogate(charArray[i + 1])) {
            result.add(new String(new char[]{charArray[i], charArray[i + 1]}));
            i += 2;
        } else {
            result.add(new String(new char[]{charArray[i]}));
            i++;
        }
    }
    return result;
}

@Test
void getUnicodeCharacters() {
    String str = "AB😄C😊D👨‍👩‍👧‍👦E🏳️‍🌈‍👭";
    System.out.println(str.codePointCount(0, str.length()));
    for (String unicodeCharacter : UTF_16.getUnicodeCharacters(str)) {
        if ("\u200D".equals(unicodeCharacter)
                || "\uFE0F".equals(unicodeCharacter))
            continue;
        System.out.println(unicodeCharacter);
    }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM