简体   繁体   中英

Parser CDATA xml

Having an XML with an embedded XML inside a [CDATA] any idea how can we parser that xml?

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
    <b>
        <c>
            <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
        </c>
    </b>
</a>

I cannot use regex/replace over the value of c since the embedded XML is an xml of 250mb size, and if I try ant of those operators I got a Java Heap Out of memory .

You may try to use Jsoup . Jsoup is actually an html parser, but is also capable of parsing xml. It is quite intuitive and once you are familiar with the selector syntax it is very easy to use. You can parse the content of your cdata to a CDataNode and use the built-in methods to get what you need.

Maven dependency:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Modified your very simplified xml given above to have an example to play around with:

import org.jsoup.Jsoup;
import org.jsoup.nodes.CDataNode;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class TestJavaClass {

    public static void main(String[] args) {
        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n"
                + "<axx>\n"
                + "    <bxx>\n"
                + "        <cxx>\n"
                + "            <![CDATA["
                + "                     <?xml version=\"1.0\"?>\n"
                + "                         <catalog>\n"
                + "                             <book id=\"bk101\">\n"
                + "                                 <author>Gambardella, Matthew</author>\n"
                + "                                 <title>XML Developer's Guide</title>\n"
                + "                                 <genre>Computer</genre>\n"
                + "                                 <price>44.95</price>\n"
                + "                                 <publish_date>2000-10-01</publish_date>\n"
                + "                                 <description>An in-depth look at creating applications \n"
                + "                                 with XML.</description>\n"
                + "                             </book>\n"
                + "                             <book id=\"bk102\">\n"
                + "                                 <author>Ralls, Kim</author>\n"
                + "                                 <title>Midnight Rain</title>\n"
                + "                                 <genre>Fantasy</genre>\n"
                + "                                 <price>5.95</price>\n"
                + "                                 <publish_date>2000-12-16</publish_date>\n"
                + "                                 <description>A former architect battles corporate zombies, \n"
                + "                                 an evil sorceress, and her own childhood to become queen \n"
                + "                                 of the world.</description>\n"
                + "                             </book>"
                + "                         </catalog>"
                + "                 ]]>\n"
                + "        </cxx>\n"
                + "    </bxx>\n"
                + "</axx>\n";

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
        CDataNode cdata = (CDataNode) doc.selectFirst("cxx").childNode(1);

        Document cdataDoc = Jsoup.parse(cdata.text(),"", Parser.xmlParser());
        Elements authors = cdataDoc.select("book author");
        authors.forEach(aut -> {
            System.out.println(aut.text());
        });
    }
}

Output:

Gambardella, Matthew
Ralls, Kim

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM