简体   繁体   English

解析器 CDATA xml

[英]Parser CDATA xml

Having an XML with an embedded XML inside a [CDATA] any idea how can we parser that xml?[CDATA]有一个带有嵌入 XML 的 XML,我们如何解析该 xml?

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
    <b>
        <c>
            <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
        </c>
    </b>
</a>

I cannot use regex/replace over the value of c since the embedded XML is an xml of 250mb size, and if I try ant of those operators I got a Java Heap Out of memory .我不能在c的值上使用正则表达式/替换,因为嵌入的 XML 是250mb大小的 xml,如果我尝试使用这些运算符,我会得到Java Heap Out of memory

You may try to use Jsoup .您可以尝试使用Jsoup Jsoup is actually an html parser, but is also capable of parsing xml. Jsoup 实际上是一个 html 解析器,但也能够解析 xml。 It is quite intuitive and once you are familiar with the selector syntax it is very easy to use.它非常直观,一旦您熟悉了选择器语法,就非常容易使用。 You can parse the content of your cdata to a CDataNode and use the built-in methods to get what you need.您可以将 cdata 的内容解析为CDataNode并使用内置方法来获取您需要的内容。

Maven dependency: Maven 依赖:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Modified your very simplified xml given above to have an example to play around with:修改了上面给出的非常简化的 xml 以提供一个示例:

import org.jsoup.Jsoup;
import org.jsoup.nodes.CDataNode;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class TestJavaClass {

    public static void main(String[] args) {
        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n"
                + "<axx>\n"
                + "    <bxx>\n"
                + "        <cxx>\n"
                + "            <![CDATA["
                + "                     <?xml version=\"1.0\"?>\n"
                + "                         <catalog>\n"
                + "                             <book id=\"bk101\">\n"
                + "                                 <author>Gambardella, Matthew</author>\n"
                + "                                 <title>XML Developer's Guide</title>\n"
                + "                                 <genre>Computer</genre>\n"
                + "                                 <price>44.95</price>\n"
                + "                                 <publish_date>2000-10-01</publish_date>\n"
                + "                                 <description>An in-depth look at creating applications \n"
                + "                                 with XML.</description>\n"
                + "                             </book>\n"
                + "                             <book id=\"bk102\">\n"
                + "                                 <author>Ralls, Kim</author>\n"
                + "                                 <title>Midnight Rain</title>\n"
                + "                                 <genre>Fantasy</genre>\n"
                + "                                 <price>5.95</price>\n"
                + "                                 <publish_date>2000-12-16</publish_date>\n"
                + "                                 <description>A former architect battles corporate zombies, \n"
                + "                                 an evil sorceress, and her own childhood to become queen \n"
                + "                                 of the world.</description>\n"
                + "                             </book>"
                + "                         </catalog>"
                + "                 ]]>\n"
                + "        </cxx>\n"
                + "    </bxx>\n"
                + "</axx>\n";

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
        CDataNode cdata = (CDataNode) doc.selectFirst("cxx").childNode(1);

        Document cdataDoc = Jsoup.parse(cdata.text(),"", Parser.xmlParser());
        Elements authors = cdataDoc.select("book author");
        authors.forEach(aut -> {
            System.out.println(aut.text());
        });
    }
}

Output:输出:

Gambardella, Matthew
Ralls, Kim

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM