繁体   English   中英

如何在Java中对html实体** except ** <>&“'进行转义

[英]how to unescape html entities **except** < > & " ' in java

我在utf-8中输入了html。 在此输入中,带重音的字符表示为html实体。 例如:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body>
</html>

我的目标是通过在Java中尽可能用utf-8字符替换html实体来“规范化” html。 换句话说,替换 &lt; &gt; &amp; &quot; &apos; 之外的所有实体&lt; &gt; &amp; &quot; &apos; &lt; &gt; &amp; &quot; &apos;

目标:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

我需要这样做,以使其更容易在测试中比较html,并更易于肉眼阅读(许多逃脱重音字符使其很难阅读)。

我不在乎cdata节(输入中没有cdata)。

我尝试了JSOUP( https://jsoup.org/ )和Apache的Commons Text( https://commons.apache.org/proper/commons-text/ ),但未成功:

public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is not good, keeps only the text content
    String s1 = Jsoup.parse(html).text();
    System.out.println("s1: " + s1);

    // this is better, but it unescapes the &lt; which is not what I want
    String s2 = StringEscapeUtils.unescapeHtml4(html);
    System.out.println("s2: " + s2);
}

StringEscapeUtils.unescapeHtml4()几乎是我所需要的,但不幸的是,它还取消了<并且:

<body>árvíztűrő<b</body>

我该怎么办?

这是一个最小的演示: https : //github.com/riskop/html_utf8_canon.git

查看Commons Text源代码,很明显StringEscapeUtils.unescapeHtml4()委托工作由一个由4个CharSequenceTranslator组成的AggregateTranslator:

new AggregateTranslator(
        new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
        new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
        new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
        new NumericEntityUnescaper()
);

需要三名翻译即可完成我的目标。

就是这样:

    // this is what I needed!
    String s3 = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    ).translate(html);
    System.out.println("s3: " + s3);

整个方法:

@Test
public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is what I needed!
    CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    );

    String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
    System.out.println("s3: " + s3);

}

结果:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM