简体   繁体   中英

Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.

In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?

Thanks in advance, included below is the Class definition from Oracles website.

Class DocumentBuilderFactory

"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "

The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:

But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.

Alternatively you could programmatically remove the extra whitespace yourself using something like the following:

public static void removeEmptyTextNodes(Node node) {
    NodeList nodeList = node.getChildNodes();
    Node childNode;
    for (int x = nodeList.getLength() - 1; x >= 0; x--) {
        childNode = nodeList.item(x);
        if (childNode.getNodeType() == Node.TEXT_NODE) {
            if (childNode.getNodeValue().trim().equals("")) {
                node.removeChild(childNode);
            }
        } else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
            removeEmptyTextNodes(childNode);
        }
    }
}

It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.

Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.

The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:

StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();

InputStream    xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();

NodeList nodes = documentRoot.getChildNodes();

System.out.println("How many children does the root have? => "nodes.getLength());

for(int index = 0; index < nodes.getLength(); index++){
    System.out.println(nodes.item(index).getLocalName());
}

Output:
How many children does the root have? => 4
null
A
null
B

But if the new newlineChar is removed from the StringBuilder, the ouptput is:
How many children does the root have? => 2
A
B

This demonstrates that the DOM objects generated by DocumentBuilder are different.

There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.

It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.

But sadly i do neigther remember why that error occured nor which parser I took.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM