简体   繁体   English

如何在 Java 中取消转义 HTML 字符实体?

[英]How to unescape HTML character entities in Java?

Basically I would like to decode a given Html document, and replace all special chars, such as " "基本上我想解码给定的 Html 文档,并替换所有特殊字符,例如" " -> " " , ">" -> " " , ">" -> ">" . -> ">"

In .NET we can make use of HttpUtility.HtmlDecode .在 .NET 中,我们可以使用HttpUtility.HtmlDecode

What's the equivalent function in Java? Java 中的等效函数是什么?

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:为此使用了 Apache Commons StringEscapeUtils.unescapeHtml4()

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes.将包含实体转义的字符串取消转义为包含与转义对应的实际 Unicode 字符的字符串。 Supports HTML 4.0 entities.支持 HTML 4.0 实体。

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world html in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.其他答案中提到的库将是很好的解决方案,但如果您已经碰巧在您的项目中挖掘了真实世界的 html,那么Jsoup项目提供的不仅仅是管理“& 磅 FFFF 分号”的东西。

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.并且您还可以获得用于提取和操作数据的便捷 API,使用最好的 DOM、CSS 和类似 jquery 的方法。 It's open source and MIT licence.它是开源和麻省理工学院许可证。

I tried Apache Commons StringEscapeUtils.unescapeHtml3() in my project, but wasn't satisfied with its performance.我在我的项目中尝试了 Apache Commons StringEscapeUtils.unescapeHtml3(),但对其性能并不满意。 Turns out, it does a lot of unnecessary operations.事实证明,它做了很多不必要的操作。 For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string.一方面,它为每次调用分配一个 StringWriter,即使字符串中没有任何可转义的内容。 I've rewritten that code differently, now it works much faster.我以不同的方式重写了该代码,现在它的运行速度要快得多。 Whoever finds this in google is welcome to use it.欢迎在谷歌找到这个的人使用它。

Following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3).以下代码对所有 HTML 3 符号和数字转义符进行转义(相当于 Apache unescapeHtml3)。 You can just add more entries to the map if you need HTML 4.如果您需要 HTML 4,您可以向地图添加更多条目。

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // found escape 
            if (input.charAt(i) == '#') {
                // numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null) 
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) { 
                    i++;
                    continue;
                }
            }
            else {
                // named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null) 
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"}, // non-breaking space
        {"\u00A1", "iexcl"}, // inverted exclamation mark
        {"\u00A2", "cent"}, // cent sign
        {"\u00A3", "pound"}, // pound sign
        {"\u00A4", "curren"}, // currency sign
        {"\u00A5", "yen"}, // yen sign = yuan sign
        {"\u00A6", "brvbar"}, // broken bar = broken vertical bar
        {"\u00A7", "sect"}, // section sign
        {"\u00A8", "uml"}, // diaeresis = spacing diaeresis
        {"\u00A9", "copy"}, // © - copyright sign
        {"\u00AA", "ordf"}, // feminine ordinal indicator
        {"\u00AB", "laquo"}, // left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"}, // not sign
        {"\u00AD", "shy"}, // soft hyphen = discretionary hyphen
        {"\u00AE", "reg"}, // ® - registered trademark sign
        {"\u00AF", "macr"}, // macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"}, // degree sign
        {"\u00B1", "plusmn"}, // plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"}, // superscript two = superscript digit two = squared
        {"\u00B3", "sup3"}, // superscript three = superscript digit three = cubed
        {"\u00B4", "acute"}, // acute accent = spacing acute
        {"\u00B5", "micro"}, // micro sign
        {"\u00B6", "para"}, // pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"}, // cedilla = spacing cedilla
        {"\u00B9", "sup1"}, // superscript one = superscript digit one
        {"\u00BA", "ordm"}, // masculine ordinal indicator
        {"\u00BB", "raquo"}, // right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"}, // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"}, // Д - uppercase A, umlaut
        {"\u00C5", "Aring"}, // Е - uppercase A, ring
        {"\u00C6", "AElig"}, // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"}, // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"}, // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"}, // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"}, // П - uppercase I, umlaut
        {"\u00D0", "ETH"}, // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"}, // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"}, // Ц - uppercase O, umlaut
        {"\u00D7", "times"}, // multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"}, // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"}, // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"}, // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"}, // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"}, // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"}, // д - lowercase a, umlaut
        {"\u00E5", "aring"}, // е - lowercase a, ring
        {"\u00E6", "aelig"}, // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"}, // к - lowercase e, circumflex accent
        {"\u00EB", "euml"}, // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"}, // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"}, // п - lowercase i, umlaut
        {"\u00F0", "eth"}, // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"}, // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"}, // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"}, // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"}, // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"}, // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"}, // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES) 
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

The following library can also be used for HTML escaping in Java: unbescape .以下库也可用于 Java 中的 HTML 转义: unbescape

HTML can be unescaped this way: HTML 可以通过这种方式进行转义:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText); 

This did the job for me,这对我有用,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml(encodedXML);

or或者

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml4(encodedXML);

I guess its always better to use the lang3 for obvious reasons.我想出于显而易见的原因使用lang3总是更好。 Hope this helps :)希望这可以帮助 :)

Spring Framework HtmlUtils Spring 框架 HtmlUtils

If you're using Spring framework already, use the following method:如果您已经在使用 Spring 框架,请使用以下方法:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);

A very simple but inefficient solution without any external library is:没有任何外部库的一个非常简单但效率低下的解决方案是:

public static String unescapeHtml3( String str ) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read( new StringReader( "<html><body>" + str ), doc, 0 );
        return doc.getText( 1, doc.getLength() );
    } catch( Exception ex ) {
        return str;
    }
}

This should be use only if you have only small count of string to decode.仅当您只有少量要解码的字符串时才应使用此选项。

The most reliable way is with最可靠的方法是

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils .来自org.apache.commons.lang3.StringEscapeUtils

And to escape the whitespaces并逃避空白

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.这将确保由于在 Web 表单中复制和粘贴而导致的空白不会保留在 DB 中。

Consider using the HtmlManipulator Java class.考虑使用HtmlManipulator Java 类。 You may need to add some items (not all entities are in the list).您可能需要添加一些项目(并非所有实体都在列表中)。

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; Kevin Hakanson 建议的 Apache Commons StringEscapeUtils 对我来说并没有 100% 工作; several entities like &#145 (left single quote) were translated into '222' somehow.像 &#145 (左单引号)这样的几个实体以某种方式被翻译成“222”。 I also tried org.jsoup, and had the same problem.我也试过 org.jsoup,也有同样的问题。

In my case i use the replace method by testing every entity in every variable, my code looks like this:就我而言,我通过测试每个变量中的每个实体来使用 replace 方法,我的代码如下所示:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

In my case this worked very well.就我而言,这非常有效。

StringEscapeUtils (Apache Commons Lang) StringEscapeUtils (Apache Commons Lang)
Escapes and unescapes Strings for Java, JavaScript, HTML, and XML.转义和取消转义 Java、JavaScript、HTML 和 XML 的字符串。

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

Reference: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html参考: https : //commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,柜面你想模仿什么 php 函数 htmlspecialchars_decode 使用 php 函数 get_html_translation_table() 转储表,然后使用 java 代码,如,

static Map<String,String> html_specialchars_table = new Hashtable<String,String>();
static {
        html_specialchars_table.put("&lt;","<");
        html_specialchars_table.put("&gt;",">");
        html_specialchars_table.put("&amp;","&");
}
static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
        Enumeration en = html_specialchars_table.keys();
        while(en.hasMoreElements()){
                String key = en.nextElement();
                String val = html_specialchars_table.get(key);
                s = s.replaceAll(key, val);
        }
        return s;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM