简体   繁体   English

使用DocumentBuilderFactory将XML文档转换为DOM对象

[英]Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class. 我当前正在修改一段代码,并且想知道XML的格式化方式(制表符和间距)是否会影响将其解析为DocumentBuilderFactory类的方式。

In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way? 本质上,问题是...我可以将没有间隔的大长字符串传递给DocumentBuilderFactory还是需要以某种方式对其进行格式化?

Thanks in advance, included below is the Class definition from Oracles website. 预先感谢,下面是Oracle网站上的类定义。

Class DocumentBuilderFactory 类DocumentBuilderFactory

"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. " “定义一个工厂API,使应用程序能够获得一个解析器,该解析器可以从XML文档生成DOM对象树。”

The documents will be different. 文件会有所不同。 Tabs and new lines will be converted into text nodes. 制表符和换行符将转换为文本节点。 You can eliminate these using the following method on DocumentBuilderFactory: 您可以使用DocumentBuilderFactory上的以下方法消除它们:

But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema. 但是为了使其工作,您必须设置DOM解析器以根据DTD或xml模式验证内容。

Alternatively you could programmatically remove the extra whitespace yourself using something like the following: 或者,您可以使用类似以下的方法自己以编程方式删除多余的空格:

public static void removeEmptyTextNodes(Node node) {
    NodeList nodeList = node.getChildNodes();
    Node childNode;
    for (int x = nodeList.getLength() - 1; x >= 0; x--) {
        childNode = nodeList.item(x);
        if (childNode.getNodeType() == Node.TEXT_NODE) {
            if (childNode.getNodeValue().trim().equals("")) {
                node.removeChild(childNode);
            }
        } else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
            removeEmptyTextNodes(childNode);
        }
    }
}

It should not affect the ability of the parser as long as the string is valid XML. 只要字符串是有效的 XML,它就不会影响解析器的功能。 Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader. 制表符和换行符被语法分析器剥离或忽略,实际上是为了使人类阅读者美观。

Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML. 请注意,您必须将输入流 (例如StringBufferInputStream)传递给DocumentBuilder,因为解析的字符串版本假定它是XML的URI。

The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. DocumentBuilder为带换行的xml字符串和不带换行的xml字符串构建不同的DOM对象。 Here is the code I tested: 这是我测试过的代码:

StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();

InputStream    xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();

NodeList nodes = documentRoot.getChildNodes();

System.out.println("How many children does the root have? => "nodes.getLength());

for(int index = 0; index < nodes.getLength(); index++){
    System.out.println(nodes.item(index).getLocalName());
}

Output: 输出:
How many children does the root have? => 4
null
A
null
B

But if the new newlineChar is removed from the StringBuilder, the ouptput is: 但是,如果从StringBuilder中删除了新的newlineChar ,则输出为:
How many children does the root have? => 2
A
B

This demonstrates that the DOM objects generated by DocumentBuilder are different. 这表明DocumentBuilder生成的DOM对象是不同的。

There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. 关于XML-String的格式应该没有任何影响,但是当我将一个长字符串传递给XML解析器时,我可以记住一个奇怪的问题。 The paser was unable to parse a XML-File as it was written all in one long line. 由于它是一长行编写的,因此paser无法解析XML文件。

It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes. 如果您以这种方式插入换行符,则行的长度最好不超过1000个字节,这可能会更好。

But sadly i do neigther remember why that error occured nor which parser I took. 但是可悲的是,我确实清楚地记得为什么会发生该错误,也不知道我使用了哪个解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM