简体   繁体   English

使用 Java 标准库将 HTML 字符转换回文本

[英]Convert HTML Character Back to Text Using Java Standard Library

I would like to convert some HTML characters back to text using Java Standard Library.我想使用 Java 标准库将一些 HTML 字符转换回文本。 I was wondering whether any library would achieve my purpose?我想知道是否有任何图书馆可以达到我的目的?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy & Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for.我认为 Apache Commons Lang 库的StringEscapeUtils.unescapeHtml3()unescapeHtml4()方法正是您要找的。 See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html .请参阅https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

Here you have to just add jar file in lib jsoup in your application and then use this code.在这里,您只需在应用程序的 lib jsoup 中添加 jar 文件,然后使用此代码。

import org.jsoup.Jsoup;

public class Encoder {
    public static void main(String args[]) {
        String s = Jsoup.parse("<Français>").text();
        System.out.print(s);
    }
}

Link to download jsoup: http://jsoup.org/download jsoup下载链接: http : //jsoup.org/download

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (eg "%20" represents space), not with HTML character entities . java.net.URLDecoder只处理application/x-www-form-urlencoded MIME 格式(例如“%20”代表空格),而不处理HTML 字符实体 I don't think there's anything on the Java platform for that.我认为 Java 平台上没有任何内容。 You could write your own utility class to do the conversion, like this one .您可以编写自己的实用程序类来进行转换,就像这样

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. URL 解码器应该只用于从“application/x-www-form-urlencoded”mime 类型的 html 表单生成的 url 中解码字符串。 This does not support html characters.这不支持 html 字符。

After asearch I found a Translate class within the HTML Parser library. 搜索后,我在HTML Parser库中找到了一个Translate类。

You can use the class org.apache.commons.lang.StringEscapeUtils:您可以使用类 org.apache.commons.lang.StringEscapeUtils:

String s = StringEscapeUtils.unescapeHtml("Happy & Sad")

It is working.这是工作。

I'm not aware of any way to do it using the standard library.我不知道有什么方法可以使用标准库来做到这一点。 But I do know and use this class that deals with html entities.但我知道并使用这个处理 html 实体的类。

"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa." “HTMLEntities 是一个开源 Java 类,它包含一组静态方法(htmlentities、unhtmlentities 等),用于将特殊字符和扩展字符转换为 HTML 实体,反之亦然。”

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

As @jem suggested, it is possible to use jsoup.正如@jem 建议的那样,可以使用 jsoup。

With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.使用 jSoup 1.8.3,可以使用保留原始 html 的Parser.unescapeEntities方法。

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);

It seems that in some previous release this method is not present.似乎在某些以前的版本中不存在此方法。

Or you can use unescapeHtml4:或者你可以使用 unescapeHtml4:

    String miCadena="GUÍA TELEFÓNICA";
    System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));

This code print the line: GUÍA TELEFÓNICA此代码打印以下行:GUÍA TELEFÓNICA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM