使用stax和dom读取大的XML文件

Question

I need to read several big (200Mb-500Mb) XML files, so I want to use StaX. 我需要读取几个大的（200Mb-500Mb）XML文件，所以我想使用StaX。 My system has two modules - one to read the file ( with StaX ); 我的系统有两个模块-一个用于读取文件（使用StaX）；另一个用于读取文件。 another module ( 'parser' module ) suppose to get a single entry of that XML and parse it using DOM. 另一个模块（“解析器”模块）假定获取该XML的单个条目并使用DOM对其进行解析。 My XML files don't have a certain structure - so I cannot use JaxB. 我的XML文件没有特定的结构-因此我无法使用JaxB。 How can I pass the 'parser' module a specific entry that I want it to parse? 如何向“解析器”模块传递要解析的特定条目？ For example: 例如：

<Items>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
</Items>

I want to use StaX to parse that file - but each 'item' entry will be passed to the 'parser' module. 我想使用StaX来解析该文件-但每个“项目”条目都将传递到“解析器”模块。

Edit: 编辑：
After a little more reading - I think I need a library that reads an XML file using stream - but parse each entry using DOM. 经过一番阅读之后-我想我需要一个使用流读取XML文件的库-但使用DOM解析每个条目。 Is there such a thing? 有这样的事吗？

Answer 1

You could use a StAX ( javax.xml.stream ) parser and transform ( javax.xml.transform ) each section to a DOM node ( org.w3c.dom ): 您可以使用StAX（ javax.xml.stream ）解析器，并将每个部分转换（ javax.xml.transform ）到DOM节点（ org.w3c.dom ）：

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.dom.DOMResult;
import org.w3c.dom.*

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult result = new DOMResult();
            t.transform(new StAXSource(xsr), result);
            Node domNode = result.getNode();
        }
    }

}

Also see: 另请参阅：

Split 1GB Xml file using Java 使用Java分割1GB Xml文件

Answer 2

Blaise Doughan's answer fails in clean java 7 and 8 due to https://bugs.openjdk.java.net/browse/JDK-8016914 由于https://bugs.openjdk.java.net/browse/JDK-8016914，Blaise Doughan的答案在干净的Java 7和8中失败

java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.setXmlVersion(CoreDocumentImpl.java:860)
at com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM.setDocumentInfo(SAX2DOM.java:144)

Funny thing: if you use jaxb unmarshaller, you don't get the NPE: 有趣的是：如果使用jaxb解组器，则不会获得NPE：

package com.common.config;

import java.io.*;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;

import org.w3c.dom.*;

public class Demo {


    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        // Advance to root element
        xsr.nextTag(); // TODO: nextTag() can't skip DTD
        xsr.next(); // Advance to first item or EOD

        final JAXBContext jaxbContext = JAXBContext.newInstance();
        final Unmarshaller unm = jaxbContext.createUnmarshaller();
        while(true) {
            // previous unmarshal() already did advance to next element or whitespace
            if (xsr.getEventType() == XMLStreamReader.START_ELEMENT) {
                JAXBElement<Object> jel = unm.unmarshal(xsr, Object.class);
                Node domNode = (Node)jel.getValue();
                System.err.println(domNode.getNodeName());
            } else if (!xsr.hasNext()) {
                    break;
            } else {
                xsr.next();
            }
        }
    }

}

The reason is: com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXConnector$1 does not implement Locator2 therefore it has no getXMLVersion() . 原因是： com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXConnector$1没有实现Locator2因此它没有getXMLVersion() 。

Answer 3

you can try XMLDog from JLibs. 您可以尝试JLibs的XMLDog。

It evaluates xpath on xml document using SAX (ie without loading entire xml into memory). 它使用SAX评估xml文档上的xpath（即，无需将整个xml加载到内存中）。 and returns dom nodes for the nodes as they are hit. 并在命中节点时返回dom节点。

thus you can evaluate xpath /Items/Item on your fat xml document. 因此，您可以在胖xml文档中评估xpath / Items / Item。 you will be notified as each Item node is parsed. 系统会在解析每个Item节点时通知您。 you can process the current Item dom node, and continue. 您可以处理当前的Item dom节点，然后继续。

Thus it is suitable for evaluating xpaths on large documents 因此，它适用于评估大型文档上的xpath

使用stax和dom读取大的XML文件

问题描述

3 个解决方案

解决方案1
18 已采纳 2012-02-21 17:27:15

解决方案2
2 2018-12-19 11:33:04

解决方案3
0 2012-02-21 16:12:10

使用stax和dom读取大的XML文件

问题描述

3 个解决方案

解决方案1 18 已采纳 2012-02-21 17:27:15

解决方案2 2 2018-12-19 11:33:04

解决方案3 0 2012-02-21 16:12:10

解决方案1
18 已采纳 2012-02-21 17:27:15

解决方案2
2 2018-12-19 11:33:04

解决方案3
0 2012-02-21 16:12:10