如何从字符串中删除所有没有可打印的字符+表情符号？

Question

I want to remove all non printable characters + all Emoji from my String. 我想从我的字符串中删除所有不可打印的字符+所有表情符号。

I tried with that but it doesn't work properly for Emoji: 我尝试了一下，但是对于Emoji来说无法正常工作：

public static String removeAllNoAsciiChars(String str) {
        if (!TextUtils.isEmpty(str)) {
            str = str.replaceAll("\\p{C}", "");
        }
        return str;
    }

Examples: 例子：

"L'alphabet est génial 😀!" “ L'alphabet estgénial😀！”

Final result expected: "L'alphabet est génial !" 预期的最终结果是：“ L'alphabet estgénial！”

"Ça c'est du cœur ❤️ :) !" “Çaestce ducœur❤️:)！”

Final result expected: "Ça c'est du cœur :) !" 预期的最终结果是：“Çaestestcour” ：）！

Answer 1

The \\\\p{C} regex takes care of all non-printable characters. \\\\p{C}正则表达式负责所有不可打印的字符。 Be aware that this includes tabs and newlines. 请注意，这包括选项卡和换行符。

As for Emoji characters, that a bit more complicated. 至于表情符号字符，则有点复杂。 You could just match the newer Emoji characters in Unicode, ie Unicode Block 'Emoticons' (U+1F600 to U+1F64F), but that's not really all the Emoji characters, eg ❤ 'HEAVY BLACK HEART' (U+2764) is not in that range. 您可以用Unicode匹配较新的Emoji字符，即Unicode块“ Emoticons” （U + 1F600到U + 1F64F），但这并不是所有的Emoji字符，例如❤'HEAVY BLACK HEART' （U + 2764）不是在那个范围内。

If you look at those Emoji characters, eg 😀 'GRINNING FACE' (U+1F600), you'll see that it belongs to Unicode Category " Symbol, Other [So] ", which consists of 5855 characters. 如果查看那些Emoji字符，例如😀'GRINNING FACE' （U + 1F600），您会看到它属于Unicode类别“ Symbol，Other [So] ”，它由5855个字符组成。 If you're ok removing all those, that would definitely be the easiest solution. 如果您可以删除所有这些内容，那肯定是最简单的解决方案。

Your text included a red heart (❤️), not a black heart (❤), and that is done in Unicode by adding a variation selector after the black heart, eg a 'VARIATION SELECTOR-16' (U+FE0F) in this case. 您的文本包含红色的心（❤️），而不是黑色的心（❤），在Unicode中，通过在黑色的心之后添加变体选择器来完成，例如，在这种情况下为'VARIATION SELECTOR-16' （U + FE0F）。 There are 256 variation selectors, and they are all in category Mark, Nonspacing [Mn] , but you probably don't want to remove all 1763 of those, so you need to remove the 2 ranges of variation selectors, ie U+FE00 to U+FE0F (selectors 1-16) and U+E0100 to U+E01EF (selectors 17-256). 有256个变体选择器，它们都在Mark，Nonspacing [Mn]类别中，但是您可能不想删除所有的1763个变体选择器，因此需要删除2个范围的变体选择器，即U + FE00到U + FE0F（选择器1-16）和U + E0100至U + E01EF（选择器17-256）。

After that, you may or may not want to reduce consecutive spaces to a single space. 之后，您可能会或可能不想将连续的空格减少到一个空格。

str = str.replaceAll("[\\p{C}\\p{So}\uFE00-\uFE0F\\x{E0100}-\\x{E01EF}]+", "")
         .replaceAll(" {2,}", " ");

如何从字符串中删除所有没有可打印的字符+表情符号？

问题描述

1 个解决方案

解决方案1
5 2018-01-02 09:50:11

如何从字符串中删除所有没有可打印的字符+表情符号？

问题描述

1 个解决方案

解决方案1 5 2018-01-02 09:50:11

解决方案1
5 2018-01-02 09:50:11