简体   繁体   中英

Escape Unicode Character 'POPCORN' to HTML Entity

I have a string with an emoji in it

I love 🍿

I need to escape that popcorn emoji with it's html entity so I get

I love 🍿

I'm am writing my code in Java and I have been trying different StringEscapeUtils libraries but haven't gotten it to work. Please help me figure out what I can use to escape special characters like Popcorn.

For reference:

Unicode Character Information

Unicode 8.0 (June 2015)

It's a little hacky, because I don't believe there is a ready made library to do this; assuming you can't simply use UTF-8 (or UTF-16) on your HTML page (which should be able to render 🍿 as is), you can use Character.codePointAt(CharSequence, int) and Character.offsetByCodePoints(CharSequence, int, int) 1 to perform the conversion if the given character is outside the normal ASCII range. Something like,

String str = "I love 🍿";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
    char ch = str.charAt(i);
    if (ch > 127) {
        sb.append(String.format("&#x%x;", Character.codePointAt(str, i)));
        i += Character.offsetByCodePoints(str, i, 1) - 1;
    } else {
        sb.append(ch);
    }
}
System.out.println(sb);

which outputs (as requested)

I love &#x1f37f;

1 Edited based on helpful comments from Andreas.

Normally the emoji4j library works. It has a simple htmlify method for HTML encoding.

For example:

String text = "I love 🍿";

EmojiUtils.htmlify(text); //returns "I love &#127871"

EmojiUtils.hexHtmlify(text); //returns "I love &#x1f37f"

You may use the unbescape library: unbescape: powerful, fast and easy escape/unescape operations for Java .

Example

Add the dependency into the pom.xml file:

<dependency>
    <groupId>org.unbescape</groupId>
    <artifactId>unbescape</artifactId>
    <version>1.1.6.RELEASE</version>
</dependency>

The usage:

import org.unbescape.html.HtmlEscape;
import org.unbescape.html.HtmlEscapeLevel;
import org.unbescape.html.HtmlEscapeType;

<…>

final String inputString = "\uD83C\uDF7F";
final String escapedString = HtmlEscape.escapeHtml(
    inputString,
    HtmlEscapeType.HEXADECIMAL_REFERENCES,
    HtmlEscapeLevel.LEVEL_2_ALL_NON_ASCII_PLUS_MARKUP_SIGNIFICANT
);

// Here `escapedString` has the value: `&#x1f37f;`.

For your use case, probably, either HtmlEscapeType.HTML4_NAMED_REFERENCES_DEFAULT_TO_HEXA or HtmlEscapeType.HTML5_NAMED_REFERENCES_DEFAULT_TO_HEXA should be used instead of HtmlEscapeType.HEXADECIMAL_REFERENCES .

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love &#x1f37f;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM