简体   繁体   中英

How can i replace every emoji in a string with their unicode in java?

I have a string like this:

"\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"" ...

and i want to replace every emoji symbol with their unicode value like so:

"\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"" ...

After searching a lot on the web, I found a way to "translate" one symbol to their unicode with this code:

String s = "👺";
int emoji = Character.codePointAt(s, 0); 
String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();

But now how can i change my code to get all emoji in a string?

Ps it can either be \\Uxxxxx or U+xxxxx format

Try this solution:

String s = "your string with emoji";

StringBuilder sb = new StringBuilder();

for (int i = 0; i < s.length(); i++) {
  if (Character.isSurrogate(s.charAt(i))) {
    Integer res = Character.codePointAt(s, i);
    i++;
    sb.append("U+" + Integer.toHexString(res).toUpperCase());
  } else {
    sb.append(s.charAt(i));
  }
}

//result
System.out.println(sb.toString());

Emoji are scattered among different unicode blocks . For example 👺(0x1F47A) and 💖(0x1F496) are from Miscellaneous Symbols and Pictographs , while 😁(0x1F601) is from Emoticons

If you want to filter out symbols you need to decide what unicode blocks (or their range) you want to use. For example:

    String s = "\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"";
    StringBuilder sb = new StringBuilder();
    for (int i = 0, l = s.length() ; i < l ; i++) {
      char ch = s.charAt(i);
      if (Character.isHighSurrogate(ch)) {
        i++;
        char ch2 = s.charAt(i); // Load low surrogate
        int codePoint = Character.toCodePoint(ch, ch2);
        if ((codePoint >= 0x1F300) && (codePoint <= 0x1F64F)) { // Miscellaneous Symbols and Pictographs + Emoticons
          sb.append("U+").append(Integer.toHexString(codePoint).toUpperCase());
        } else { // otherwise just add characters as is
          sb.append(ch);
          sb.append(ch2);
        }
      } else { // if not a surrogate, just add the character
        sb.append(ch);
      }
    }
    String result = sb.toString();
    System.out.println(result); // "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."

To get only emojis you can narrow the condition using, for example, this list

But if you want to escape any surrogate symbol, you can get rid of codePoint check inside the code

In your code you don't need to specify any code point ranges, nor do you need to worry about surrogates. Instead, just specify the Unicode blocks for which you want characters to be presented as Unicode escapes. This is achieved by using the field declarations in the Character.UnicodeBlock class. For example, to determine whether 😁(0x1F601) is an emoticon:

boolean emoticon = Character.UnicodeBlock.EMOTICONS.equals(Character.UnicodeBlock.of("😁".codePointAt(0)));
System.out.println("Is 😁 an emoticon? " + emoticon); // Prints true.

Here's general purpose code. It will process any String , presenting individual characters as their Unicode equivalents if they are defined within the specified Unicode code blocks:

package symbolstounicode;

import java.util.List;
import java.util.stream.Collectors;

public class SymbolsToUnicode {

    public static void main(String[] args) {

        Character.UnicodeBlock[] blocksToConvert = new Character.UnicodeBlock[]{
            Character.UnicodeBlock.EMOTICONS, 
            Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS};
        String input = "\"title\":\"👺TEST title value 😁\",\"text\":\"💖 TEST text value.\"";
        String output = SymbolsToUnicode.toUnicode(input, blocksToConvert);

        System.out.println("String to convert: " + input);
        System.out.println("Converted string: " + output);
        assert ("\"title\":\"U+1F47ATEST title value U+1F601\",\"text\":\"U+1F496 TEST text value.\"".equals(output));
    }

    // Converts characters in the supplied string found in the specified list of UnicodeBlocks to their Unicode equivalents.
    static String toUnicode(String s, final Character.UnicodeBlock[] blocks) {

        StringBuilder sb = new StringBuilder("");
        List<Integer> cpList = s.codePoints().boxed().collect(Collectors.toList());

        cpList.forEach(cp -> sb.append(SymbolsToUnicode.inCodeBlock(cp, blocks) ? 
                "U+" + Integer.toHexString(cp).toUpperCase() : Character.toString(cp)));
        return sb.toString();
    }

    // Returns true if the supplied code point is within one of the specified UnicodeBlocks.
    static boolean inCodeBlock(final int cp, final Character.UnicodeBlock[] blocksToConvert) {

        for (Character.UnicodeBlock b : blocksToConvert) {
            if (b.equals(Character.UnicodeBlock.of(cp))) {
                return true;
            }
        }
        return false;
    }
}

And here's the output, using the test data in the OP:

run:
String to convert: "title":"👺TEST title value 😁","text":"💖 TEST text value."
Converted string: "title":"U+1F47ATEST title value U+1F601","text":"U+1F496 TEST text value."
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

  • I used font Segoe UI Symbol for the code and the output window to render the symbols properly.
  • The basic idea in the code is:
    • First, specify the String to be converted, and the Unicode code blocks for which characters should be converted to Unicode.
    • Next, convert the String into a set of code points using String.codePoints() , and store them in a List .
    • Finally, for each code point, determine whether it exists within any of the specified Unicode blocks, and convert it if necessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM