简体   繁体   English

如何在Java中对html实体** except ** <>&“'进行转义

[英]how to unescape html entities **except** < > & " ' in java

I have html input in utf-8. 我在utf-8中输入了html。 In this input accented characters are presented as html entities. 在此输入中,带重音的字符表示为html实体。 For example: 例如:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body>
</html>

My goal is to "canonicalize" the html by replacing html entities with utf-8 characters where possible in Java. 我的目标是通过在Java中尽可能用utf-8字符替换html实体来“规范化” html。 In other words, replace all entities except &lt; &gt; &amp; &quot; &apos; 换句话说,替换 &lt; &gt; &amp; &quot; &apos; 之外的所有实体&lt; &gt; &amp; &quot; &apos; &lt; &gt; &amp; &quot; &apos; .

The goal: 目标:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

I need this to make it easier to compare htmls in tests, and to be easier to read for the naked eye (lots of escaped accented characters makes it very hard to read). 我需要这样做,以使其更容易在测试中比较html,并更易于肉眼阅读(许多逃脱重音字符使其很难阅读)。

I don't care cdata sections (there's no cdata in the inputs). 我不在乎cdata节(输入中没有cdata)。

I have tried JSOUP ( https://jsoup.org/ ) and Apache's Commons Text ( https://commons.apache.org/proper/commons-text/ ) unsuccessfully: 我尝试了JSOUP( https://jsoup.org/ )和Apache的Commons Text( https://commons.apache.org/proper/commons-text/ ),但未成功:

public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is not good, keeps only the text content
    String s1 = Jsoup.parse(html).text();
    System.out.println("s1: " + s1);

    // this is better, but it unescapes the &lt; which is not what I want
    String s2 = StringEscapeUtils.unescapeHtml4(html);
    System.out.println("s2: " + s2);
}

The StringEscapeUtils.unescapeHtml4() is almost what I need, but it unfortunately unescapes the < also: StringEscapeUtils.unescapeHtml4()几乎是我所需要的,但不幸的是,它还取消了<并且:

<body>árvíztűrő<b</body>

How should I do it? 我该怎么办?

Here is a minimal demonstration: https://github.com/riskop/html_utf8_canon.git 这是一个最小的演示: https : //github.com/riskop/html_utf8_canon.git

Looking into the Commons Text source it is clear that StringEscapeUtils.unescapeHtml4() delegates work to an AggregateTranslator, which is composed of 4 CharSequenceTranslator: 查看Commons Text源代码,很明显StringEscapeUtils.unescapeHtml4()委托工作由一个由4个CharSequenceTranslator组成的AggregateTranslator:

new AggregateTranslator(
        new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
        new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
        new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
        new NumericEntityUnescaper()
);

I need only three of the translators to fullfill my goal. 需要三名翻译即可完成我的目标。

So this is it: 就是这样:

    // this is what I needed!
    String s3 = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    ).translate(html);
    System.out.println("s3: " + s3);

Whole method: 整个方法:

@Test
public void test() throws Exception {

    String html = 
            "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
            "</head><body>&aacute;rv&iacute;zt&#x0171;r&#x0151;&lt;b</body></html>";

    // this is what I needed!
    CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
            new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
            new NumericEntityUnescaper()
    );

    String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
    System.out.println("s3: " + s3);

}

Result: 结果:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő&lt;b</body>
</html>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Java (&#39;) 中取消转义 HTML 5 实体 - How to unescape HTML 5 entities in Java (&apos;) 如何在 Java 中取消转义 HTML 字符实体? - How to unescape HTML character entities in Java? 如何在Android中使用特殊字符(例如“ &lt;span style = \\“ color:#ff0000; \\” gt;“)对utf-8编码的字符串进行解码? - How to decode the utf-8 encoded string with special characters like “&lt;span style=\&quot;color:#ff0000;\&quot;gt;” in android? 解码正则表达式-^ [a-zA-Z0-9“&#39;&!#$%()* +,-。/:;?@ [\\\\] ^ _`{|}〜] + $ - Decode Regex expression - ^[a-zA-Z0-9 &quot;&apos;&amp;!#$%()*+,-./:;?@[\\]^_`{|}~]+$ Java XML API将“转换为” - Java XML API converts &quot; to &amp;quot; 解析Java中包含&lt;和&gt;标签的HTML数据? - Parse HTML data in Java including &lt and &gt tags? 如何通过转义特殊字符(如&lt;&gt; $ amp;)从URL下载XML文件; 等等? - how to Download a XML file from a URL by Escaping Special Characters like &lt; &gt; $amp; etc? 如何在解析XML并将其作为新XML文件写回时保留“或&lt;或&gt;之类的转义字符 - How to retain escape characters like &quot; or &lt or &gt while parsing XML and writing them back as new XML file 如何取消转义 HTML 实体但不影响 XML 实体? - How to unescape HTML entities but leave XML entities untouched? 除非是&或&#39;pattern,否则删除所有非单词char - Remove all non-word char except if &amp; or &apos; pattern
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM