如何在Java中對html實體 except <>＆“'進行轉義

Question

我在utf-8中輸入了html。 在此輸入中，帶重音的字符表示為html實體。 例如：

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body>
</html>

我的目標是通過在Java中盡可能用utf-8字符替換html實體來“規范化” html。 換句話說，替換除 < > & " ' 之外的所有實體< > & " ' < > & " ' 。

目標：

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

我需要這樣做，以使其更容易在測試中比較html，並更易於肉眼閱讀（許多逃脫重音字符使其很難閱讀）。

我不在乎cdata節（輸入中沒有cdata）。

我嘗試了JSOUP（ https://jsoup.org/ ）和Apache的Commons Text（ https://commons.apache.org/proper/commons-text/ ），但未成功：

public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is not good, keeps only the text content
    String s1 = Jsoup.parse(html).text();
    System.out.println("s1: " + s1);

    // this is better, but it unescapes the &lt; which is not what I want
    String s2 = StringEscapeUtils.unescapeHtml4(html);
    System.out.println("s2: " + s2);
}

StringEscapeUtils.unescapeHtml4（）幾乎是我所需要的，但不幸的是，它還取消了<並且：

<body>árvíztűrő<b</body>

我該怎么辦？

這是一個最小的演示： https : //github.com/riskop/html_utf8_canon.git

Answer 1

查看Commons Text源代碼，很明顯StringEscapeUtils.unescapeHtml4（）委托工作由一個由4個CharSequenceTranslator組成的AggregateTranslator：

new AggregateTranslator(
        new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
        new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
        new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
        new NumericEntityUnescaper()
);

我只需要三名翻譯即可完成我的目標。

就是這樣：

    // this is what I needed!
    String s3 = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    ).translate(html);
    System.out.println("s3: " + s3);

整個方法：

@Test
public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is what I needed!
    CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    );

    String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
    System.out.println("s3: " + s3);

}

結果：

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

如何在Java中對html實體 except <>＆“'進行轉義

問題描述

1 個解決方案

解決方案1
0 2018-03-09 13:34:06

如何在Java中對html實體** except ** &lt;&gt;＆“&#39;進行轉義

問題描述

1 個解決方案

解決方案1 0 2018-03-09 13:34:06

如何在Java中對html實體 except <>＆“'進行轉義

解決方案1
0 2018-03-09 13:34:06