简体   繁体   English

我如何用java中的unicode替换字符串中的每个表情符号?

[英]How can i replace every emoji in a string with their unicode in java?

I have a string like this:我有一个这样的字符串:

"\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"" ...

and i want to replace every emoji symbol with their unicode value like so:我想用它们的 unicode 值替换每个表情符号,如下所示:

"\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"" ...

After searching a lot on the web, I found a way to "translate" one symbol to their unicode with this code:在网上搜索了很多之后,我找到了一种使用以下代码将一个符号“翻译”为其 unicode 的方法:

String s = "👺";
int emoji = Character.codePointAt(s, 0); 
String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();

But now how can i change my code to get all emoji in a string?但是现在如何更改我的代码以获取字符串中的所有表情符号?

Ps it can either be \\Uxxxxx or U+xxxxx format Ps 可以是 \\Uxxxxx 或 U+xxxxx 格式

Try this solution:试试这个解决方案:

String s = "your string with emoji";

StringBuilder sb = new StringBuilder();

for (int i = 0; i < s.length(); i++) {
  if (Character.isSurrogate(s.charAt(i))) {
    Integer res = Character.codePointAt(s, i);
    i++;
    sb.append("U+" + Integer.toHexString(res).toUpperCase());
  } else {
    sb.append(s.charAt(i));
  }
}

//result
System.out.println(sb.toString());

Emoji are scattered among different unicode blocks .表情符号分散在不同的unicode 块中 For example 👺(0x1F47A) and 💖(0x1F496) are from Miscellaneous Symbols and Pictographs , while 😁(0x1F601) is from Emoticons例如👺(0x1F47A) 和💖(0x1F496) 来自杂项符号和象形文字,而😁(0x1F601) 来自表情符号

If you want to filter out symbols you need to decide what unicode blocks (or their range) you want to use.如果要过滤掉符号,则需要决定要使用哪些 unicode 块(或它们的范围)。 For example:例如:

    String s = "\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"";
    StringBuilder sb = new StringBuilder();
    for (int i = 0, l = s.length() ; i < l ; i++) {
      char ch = s.charAt(i);
      if (Character.isHighSurrogate(ch)) {
        i++;
        char ch2 = s.charAt(i); // Load low surrogate
        int codePoint = Character.toCodePoint(ch, ch2);
        if ((codePoint >= 0x1F300) && (codePoint <= 0x1F64F)) { // Miscellaneous Symbols and Pictographs + Emoticons
          sb.append("U+").append(Integer.toHexString(codePoint).toUpperCase());
        } else { // otherwise just add characters as is
          sb.append(ch);
          sb.append(ch2);
        }
      } else { // if not a surrogate, just add the character
        sb.append(ch);
      }
    }
    String result = sb.toString();
    System.out.println(result); // "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."

To get only emojis you can narrow the condition using, for example, this list要仅获取表情符号,您可以使用例如此列表来缩小条件范围

But if you want to escape any surrogate symbol, you can get rid of codePoint check inside the code但是如果你想转义任何代理符号,你可以在代码中去掉codePoint检查

In your code you don't need to specify any code point ranges, nor do you need to worry about surrogates.在您的代码中,您不需要指定任何代码点范围,也不需要担心代理。 Instead, just specify the Unicode blocks for which you want characters to be presented as Unicode escapes.相反,只需指定您希望字符以 Unicode 转义形式呈现的 Unicode 块。 This is achieved by using the field declarations in the Character.UnicodeBlock class.这是通过使用Character.UnicodeBlock类中的字段声明来实现的。 For example, to determine whether 😁(0x1F601) is an emoticon:例如,判断😁(0x1F601) 是否是表情符号:

boolean emoticon = Character.UnicodeBlock.EMOTICONS.equals(Character.UnicodeBlock.of("😁".codePointAt(0)));
System.out.println("Is 😁 an emoticon? " + emoticon); // Prints true.

Here's general purpose code.这是通用代码。 It will process any String , presenting individual characters as their Unicode equivalents if they are defined within the specified Unicode code blocks:它将处理任何String ,如果它们在指定的 Unicode 代码块中定义,则将单个字符显示为它们的 Unicode 等效项:

package symbolstounicode;

import java.util.List;
import java.util.stream.Collectors;

public class SymbolsToUnicode {

    public static void main(String[] args) {

        Character.UnicodeBlock[] blocksToConvert = new Character.UnicodeBlock[]{
            Character.UnicodeBlock.EMOTICONS, 
            Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS};
        String input = "\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"";
        String output = SymbolsToUnicode.toUnicode(input, blocksToConvert);

        System.out.println("String to convert: " + input);
        System.out.println("Converted string: " + output);
        assert ("\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"".equals(output));
    }

    // Converts characters in the supplied string found in the specified list of UnicodeBlocks to their Unicode equivalents.
    static String toUnicode(String s, final Character.UnicodeBlock[] blocks) {

        StringBuilder sb = new StringBuilder("");
        List<Integer> cpList = s.codePoints().boxed().collect(Collectors.toList());

        cpList.forEach(cp -> sb.append(SymbolsToUnicode.inCodeBlock(cp, blocks) ? 
                "U+" + Integer.toHexString(cp).toUpperCase() : Character.toString(cp)));
        return sb.toString();
    }

    // Returns true if the supplied code point is within one of the specified UnicodeBlocks.
    static boolean inCodeBlock(final int cp, final Character.UnicodeBlock[] blocksToConvert) {

        for (Character.UnicodeBlock b : blocksToConvert) {
            if (b.equals(Character.UnicodeBlock.of(cp))) {
                return true;
            }
        }
        return false;
    }
}

And here's the output, using the test data in the OP:这是输出,使用 OP 中的测试数据:

run:
String to convert: "title":"👺TEST title value 😁","text":"💖 TEST text value."
Converted string: "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:笔记:

  • I used font Segoe UI Symbol for the code and the output window to render the symbols properly.我使用字体Segoe UI Symbol作为代码和输出窗口来正确呈现符号。
  • The basic idea in the code is:代码中的基本思想是:
    • First, specify the String to be converted, and the Unicode code blocks for which characters should be converted to Unicode.首先,指定要转换的String ,以及需要将哪些字符转换为 Unicode 的 Unicode 代码块。
    • Next, convert the String into a set of code points using String.codePoints() , and store them in a List .接下来,使用String.codePoints()String转换为一组代码点,并将它们存储在List
    • Finally, for each code point, determine whether it exists within any of the specified Unicode blocks, and convert it if necessary.最后,对于每个代码点,确定它是否存在于任何指定的 Unicode 块中,并在必要时对其进行转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM