使用Jsoup无需替换即可自动解析CDATA中的标签的方法

Question

I'm reading information from a RSS that store on description tag HTML code, it isn't text plain. 我正在从RSS中读取信息，该RSS存储在description标签HTML代码中，它不是纯文本格式。 I need to extract some information like the first image would appear but I can't do it because all tags that are inside description aren't parsed by Jsoup I suppose by the behaviour of CDATA element. 我需要提取一些信息，就像第一个图像会出现一样，但是我不能这样做，因为我想通过CDATA元素的行为无法通过Jsoup解析description中的所有标签。

On my question I referer to "Automatic way" because I saw on other question published here that I would use .replace() to remove CDATA but it does not seem me an effective solution as I think it would serve to specific cases, not for universal purpose . 在我的问题上，我指的是“自动方式”，因为我在此处发布的其他问题上看到，我将使用.replace()删除CDATA，但它似乎并不是一种有效的解决方案，因为我认为它可以用于特定情况，而不是针对通用目的。 So my question is if there is a way to Jsoup make that parse without replacing text by me? 所以我的问题是，是否有一种方法可以让Jsoup进行解析而不用替换我的文本？ Is this the only way that exists? 这是唯一存在的方法吗？ I should use other library? 我应该使用其他库吗？

For example, when I parsed the RSS document, the node description has this: 例如，当我解析RSS文档时，节点描述具有以下内容：

&lt;table width='100%' border='0' cellspacing='0' cellpadding='4'&gt;&lt;tr&gt;&lt;td align='left' width='10'&gt;&lt;
a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'&gt;&lt;img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' /&gt;
&lt;/a&gt;&lt;/td&gt;&lt;td align='left' valign='top'&gt;Parece ser una nueva licencia creativa, seg&uacute;n lo visto en un enigm&aacu

All special chars "<>" are scaped because CDATA works so. 所有特殊字符“ <>”都已转义，因为CDATA可以工作。 The rest of document is well parsed only happens with CDATA content. 其余文档只有在CDATA内容中才能得到很好的解析 。

The code that I use to access: 我用来访问的代码：

doc = Jsoup.connect("http://www.3djuegos.com/universo/rss/rss.php?plats=1-2-3-4-5-6-7-34&tipos=noticia-analisis-avance-video-imagenes-demo&fotos=peques&limit=20").get();
System.out.println(doc.html()); // Shows the document well parsed.

Elements nodes = doc.getElementsByTag("item"); // Access to news
for(int i = 0; i < nodes.size(); i++){ // Loop all news

    // Description node
    Element decriptionNode = nodes.get(i).getElementsByTag("description").get(0);

    // Shows content of node. Here is where HTML tags are escaped
    System.out.println(nodes.get(i).getElementsByTag("description").html()); // Here prints the content of description tag and all HTML tags are escaped by default

    // Access to first image and here fails because of description text is escaped
    // and then Jsoup cant parsed as nodes
    Element imageNode = descriptionNode.getElementsByTag("img").get(0);
}

Edit: I use doc.outputSettings().escapeMode(EscapeMode.xhtml) but I suppose that it doesn't affect to CDATA content. 编辑：我使用doc.outputSettings().escapeMode(EscapeMode.xhtml)但我想它不会影响CDATA内容。

Edit2: I use as workaround the library org.apache.commons.lang3.StringEscapeUtils that lets unescape html but I'm still thinking about if Jsoup has already something to this scenario. Edit2：我使用org.apache.commons.lang3.StringEscapeUtils库作为变通方法，该库允许unescape html，但是我仍在考虑Jsoup是否已经对此场景有所帮助。

Answer 1

You could use the text() method to get unescaped value. 您可以使用text()方法获取未转义的值。 That mean if an element has the value like <table width='100%' border='0' cellspacing='0' cellpadding='4'> 这意味着如果一个元素的值类似于<table width='100%' border='0' cellspacing='0' cellpadding='4'> then when you do element.text() it returns <table width='100%' border='0' cellspacing='0' cellpadding='4'> . 然后当您执行element.text()它返回<table width='100%' border='0' cellspacing='0' cellpadding='4'> 。 So you can parse this fragment again to get whatever you want from this. 因此，您可以再次解析此片段，以从中获取所需内容。 Eg. 例如。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Sample {
    public static void main(String[] args) throws Exception {
        String html = "<description>"
                        + "&lt;table width='100%' border='0' cellspacing='0' cellpadding='4'&gt;&lt;tr&gt;&lt;td align='left' width='10'&gt;&lt;"
                        + "a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'&gt;&lt;img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' /&gt;"
                        + "&lt;/a&gt;&lt;/td&gt;&lt;td align='left' valign='top'&gt;Parece ser una nueva licencia creativa, seg&uacute;n lo visto en un enigm&aacu"
                    + "</description>";

        Document doc = Jsoup.parse(html);
        for(Element desc : doc.select("description")){
            String unescapedHtml = desc.text();
            String src = Jsoup.parse(unescapedHtml).select("img").first().attr("src");
            System.out.println(src);
        }
        System.out.println("Done");
    }

}

使用Jsoup无需替换即可自动解析CDATA中的标签的方法

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-07-25 13:36:01

使用Jsoup无需替换即可自动解析CDATA中的标签的方法

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-07-25 13:36:01

解决方案1
4 已采纳 2014-07-25 13:36:01