简体   繁体   English

用Java解析XML时出现问题

[英]Problems parsing XML in Java

I got some trouble parsing an XML document. 我在解析XML文档时遇到了一些麻烦。 For some reason, there are text nodes where I would not expect them to be and therefore my test turns red. 由于某些原因,有些文本节点是我所不希望的,因此测试变成红色。 The XML file looks like this: XML文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<RootNode>
  <PR1>PR1</PR1>
  <ROL>one</ROL>
  <ROL>two</ROL>
  <DG1>DG1</DG1>
  <ROL>three</ROL>
  <ZBK>ZBK</ZBK>
  <ROL>four</ROL>
</RootNode>

Now I have this snippet of code which can reproduce the error: 现在,我有此代码段可以重现该错误:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(TestHL7Helper.class.getResourceAsStream("TestHL7HelperInput.xml"));
Node root = doc.getFirstChild();
Node pr1 = root.getFirstChild();

Inspecting the root variable yields [RootNode: null] which seems to be right, but then it somehow goes all wrong. 检查根变量会产生[RootNode: null] ,这似乎是正确的,但是以某种方式它会出错。 The pr1 variable turns out to be a text node [#text:\\n ] - but why does the parser think that the new line and the spaces are a text node? pr1变量原来是文本节点[#text:\\n ] -但是解析器为何认为换行和空格是文本节点? Shouldn't that be ignored? 那不应该被忽略吗? I tried changing the encoding but that did not help either. 我尝试更改编码,但这也无济于事。 Any ideas on that? 有什么想法吗?

If I remove all new lines and space and have my XML document in just one line it all works fine... 如果我删除所有新行和空格,并将我的XML文档仅放在一行中,则一切正常。

Actually all text between other nodes forms a text-node itself. 实际上,其他节点之间的所有文本都形成一个文本节点本身。 So, if you use getFirstChild() you will also retrieve those text-nodes. 因此,如果使用getFirstChild() ,还将检索这些文本节点。

In your case all non-text child-nodes have a unique name, so you can get them individually by using getElementsByTagName() : 在您的情况下,所有非文本子节点都有唯一的名称,因此您可以使用getElementsByTagName()分别获取它们:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(TestHL7Helper.class.getResourceAsStream("TestHL7HelperInput.xml"));
Node root = doc.getFirstChild();
Node pr1 = (root.getElementsByTagName( "PR1" ))[0];

In general I would not rely on the position within the XML-document, but on stuff like tag-names, attributes or ids. 通常,我不会依赖XML文档中的位置,而是依赖标记名称,属性或ID之类的东西。

XML supports mixed content meaning elements can have both text and element child nodes. XML支持混合内容,这意味着元素可以同时具有文本和元素子节点。 This is to support use cases like the following: 这是为了支持以下用例:

<text>I've bolded the <b>important</b> part.</text>

input.xml input.xml中

This means that by default a DOM parser will treat the whitespace nodes in the following document as significant (below is a simplified version of your XML document): 这意味着默认情况下,DOM解析器会将以下文档中的空白节点视为有效节点(以下是XML文档的简化版本):

<RootNode>
  <PR1>PR1</PR1>
</RootNode>

Demo Code 示范代码

If you have an XML schema you can set the ignoringElementContentWhitespace property on the DocumentBuilderFactory since then the DOM parser will know if and when the whitespace is significant. 如果您有XML模式,则可以在DocumentBuilderFactory上设置ignoringElementContentWhitespace属性,因为DOM解析器将知道空白是否有效以及何时有效。

import java.io.File;
import javax.xml.XMLConstants;
import javax.xml.parsers.*;
import javax.xml.validation.*;

import org.w3c.dom.Document;

public class Demo {

    public static void main(String[] args) throws Exception {
        SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        Schema s = sf.newSchema(new File("src/forum16231687/schema.xsd"));

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        dbf.setSchema(s);
        dbf.setIgnoringElementContentWhitespace(true);

        DocumentBuilder db = dbf.newDocumentBuilder();
        Document d = db.parse(new File("src/forum16231687/input.xml"));
        System.out.println(d.getDocumentElement().getChildNodes().getLength());
    }

}

schema.xsd schema.xsd

If you create schema.xsd that looks like the following then the demo code will report that the root element has 1 child node. 如果您创建如下所示的schema.xsd ,则演示代码将报告根元素具有1个子节点。

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
    <element name="RootNode">
        <complexType>
            <sequence>
                <element name="PR1" type="string"/>
            </sequence>
        </complexType>
    </element>
</schema>

If you change schema.xsd so that the RootNode has mixed content the demo code will report that the RootNode has 3 child nodes. 如果更改schema.xsd,以使RootNode具有混合内容,则演示代码将报告RootNode具有3个子节点。

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
    <element name="RootNode">
        <complexType mixed="true">
            <sequence>
                <element name="PR1" type="string"/>
            </sequence>
        </complexType>
    </element>
</schema>

You can solve this general issue by checking the type of the node: 您可以通过检查节点的类型来解决此一般问题:

if (someNode instanceof Element) {
  // ...
}

This can easily form part of a loop, such as: 这很容易形成循环的一部分,例如:

NodeList childNodes = root.getChildNodes();
for (int i = 0; i < childNodes.getLength(); i++) {
  if (childNodes.item(i).getNodeType() == Node.ELEMENT) {
    Element childElement = (Element) childNodes.item(i);
    // ...
  }
}

Alternatively, use something like XMLBeans to reduce the likelihood of introducing bugs when manually parsing XML. 或者,使用XMLBeans之类的东西来减少手动解析XML时引入错误的可能性。 Get a well-tested library to do the work for you! 获取一个经过良好测试的库来为您完成工作!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM