简体   繁体   English

复杂/大型XML的Java Stax

[英]Java Stax for Complex / Large XML

I have an XML file that is 4.2 GB! 我有一个4.2 GB的XML文件! Obviously parsing the entire DOM is not practical. 显然,解析整个DOM是不切实际的。 I have been looking at SAX and STAX to accomplish parsing this gigantic XML file. 我一直在寻找SAX和STAX来完成对这个巨大的XML文件的解析。 However all the examples I've seen are simple. 但是,我看到的所有示例都很简单。 The XML file I am dealing with has nested on nested on nested. 我正在处理的XML文件嵌套在嵌套上。 There are areas where it goes 10+ levels. 在某些地区,它会达到10多个级别。

I found this tutorial but not sure if its a viable solution. 我找到了本教程,但不确定它是否可行。

http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html (botton example using STAX) http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html (使用STAX的波顿示例)

I'm not really sure how to handle nested objects. 我不太确定如何处理嵌套对象。

I have created Java objects to mimic the structure of the XML. 我创建了Java对象来模仿XML的结构。 Here are a few, too many to display. 这里有太多,无法显示。

Record.java Record.java

public class Record implements Serializable {

    String uid;
    StaticData staticData;
    DynamicData dynamicData;
}

Summary.java 摘要.java

public class Summary {

    EWUID ewuid;
    PubInfo pubInfo;
    Titles titles;
    Names names;
    DocTypes docTypes;
    Publishers publishers;
}

EWUID.java EWUID.java

public class EWUID {

    String collId;
    String edition;
}

PubInfo.java PubInfo.java

public class PubInfo {

    String coverDate;
    String hasAbstract;
    String issue;
    String pubMonth;
    String pubType;
    String pubYear;
    String sortDate;
    String volume;
}

This is the code I've come up with so far. 这是我到目前为止提出的代码。

public class TRWOSParser {

    XMLEventReader eventReader;
    XMLInputFactory inputFactory;
    InputStream inputStream;

    public TRWOSParser(String file) throws FileNotFoundException, XMLStreamException {
        inputFactory = XMLInputFactory.newInstance();
        inputStream = new FileInputStream(file);
        eventReader = inputFactory.createXMLEventReader(inputStream);
    }

    public void parse() throws XMLStreamException{

        while (eventReader.hasNext()) {
            XMLEvent event = eventReader.nextEvent();

            if (event.isStartElement()) {
                StartElement startElement = event.asStartElement();
                if (startElement.getName().getLocalPart().equals("record")) {
                    Record record = new Record();
                    Iterator<Attribute> attributes = startElement.getAttributes();
                    while (attributes.hasNext()) {
                        Attribute attribute = attributes.next();
                        if (attribute.getName().toString().equals("UID")) {
                            System.out.println("UID: " + attribute.getValue());
                        }
                    }
                }
            }
        }
    }
}

Update: 更新:

The data in the XML is licensed so I cannot show the full file. XML中的数据已获得许可,因此我无法显示完整文件。 This is a very very small segment in which I have scrambled the data. 这是一个非常小的片段,在其中我已对数据进行了加密。

<?xml version="1.0" encoding="UTF-8"?>
<records>
    <REC>
        <UID>WOS:000310438600004</UID>
        <static_data>
            <summary>
                <EWUID>
                    <WUID coll_id="WOS" />
                    <edition value="WOS.SCI" />
                </EWUID>
                <pub_info coverdate="NOV 2012" has_abstract="N" issue="5" pubmonth="NOV" pubtype="Journal" pubyear="2012" sortdate="2012-11-01" vol="188">
                    <page begin="1662" end="1663" page_count="2">1662-1663</page>
                </pub_info>
                <titles count="6">
                    <title type="source">JOURNAL OF UROLOGY</title>
                    <title type="source_abbrev">J UROLOGY</title>
                    <title type="abbrev_iso">J. Urol.</title>
                    <title type="abbrev_11">J UROL</title>
                    <title type="abbrev_29">J UROL</title>
                    <title type="item">Something something</title>
                </titles>
                <names count="1">
                    <name addr_no="1 2 3" reprint="Y" role="author" seq_no="1">
                        <display_name>John Doe</display_name>
                        <full_name>John Doe</full_name>
                        <wos_standard>Doe, John</wos_standard>
                        <first_name>John</first_name>
                        <last_name>Doe</last_name>
                    </name>
                </names>
                <doctypes count="1">
                    <doctype>Editorial Material</doctype>
                </doctypes>
                <publishers>
                    <publisher>
                        <address_spec addr_no="1">
                            <full_address>360 PARK AVE SOUTH, NEW YORK, NY 10010-1710 USA</full_address>
                            <city>NEW YORK</city>
                        </address_spec>
                        <names count="1">
                            <name addr_no="1" role="publisher" seq_no="1">
                                <display_name>ELSEVIER SCIENCE INC</display_name>
                                <full_name>ELSEVIER SCIENCE INC</full_name>
                            </name>
                        </names>
                    </publisher>
                </publishers>
            </summary>
        </static_data>
    </REC>
</records>

A similar solution to lscoughlin's answer is to use DOM4J which has mechanims to deal with this scenario: http://dom4j.sourceforge.net/ 与lscoughlin答案类似的解决方案是使用DOM4J,该机制具有处理这种情况的机制: http ://dom4j.sourceforge.net/

In my opionin it is more straight forward and easier to follow. 在我看来,它更直接,更容易理解。 It might not support namespaces, though. 但是,它可能不支持名称空间。

I'm making two assumptions 1) that there is an early level of repetition, and 2) that you can do something meaningful with a partial document. 我有两个假设:1)处于早期的重复水平; 2)您可以对部分文档进行有意义的操作。

Let's assume you can move some level of nesting in, and then handle the document multiple times, removing the nodes at the working level each time you "handle" the document. 假设您可以移入某个嵌套级别,然后多次处理文档,每次“处理”文档时都在工作级别上删除节点。 This means that only a single working subtree will be in memory at any given time. 这意味着在任何给定时间只有一个工作的子树将在内存中。

Here's a working code snippet: 这是一个工作代码段:

package bigparse;

import static javax.xml.stream.XMLStreamConstants.CHARACTERS;
import static javax.xml.stream.XMLStreamConstants.END_DOCUMENT;
import static javax.xml.stream.XMLStreamConstants.END_ELEMENT;
import static javax.xml.stream.XMLStreamConstants.START_DOCUMENT;
import static javax.xml.stream.XMLStreamConstants.START_ELEMENT;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;

public class BigParse {

    public static void main(String... args) {

        XMLInputFactory factory = XMLInputFactory.newInstance();
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();

        try {

            XMLStreamReader streamReader = factory.createXMLStreamReader(new FileReader("src/main/resources/test.xml"));
            DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

            Document document = documentBuilder.newDocument();
            Element rootElement = null;
            Element currentElement = null;

            int branchLevel = 0;
            int maxBranchLevel = 1;

            while (streamReader.hasNext()) {
                int event = streamReader.next();
                switch (event) {
                case START_DOCUMENT:
                    continue;
                case START_ELEMENT:

                    if (branchLevel < maxBranchLevel) {
                        Element workingElement = readElementOnly(streamReader, document);

                        if (rootElement == null) {
                            document.appendChild(workingElement);
                            rootElement = document.getDocumentElement();
                            currentElement = rootElement;
                        } else {
                            currentElement.appendChild(workingElement);
                            currentElement = workingElement;
                        }

                        branchLevel++;
                    } else {

                        workingLoop(streamReader, document, currentElement);

                    }

                    continue;
                case CHARACTERS:
                    currentElement.setTextContent(streamReader.getText());
                    continue;

                case END_ELEMENT:
                    if (currentElement != rootElement) {
                        currentElement = (Element) currentElement.getParentNode();
                        branchLevel--;
                    }

                    continue;

                case END_DOCUMENT:
                    break;
                }

            }

        } catch (ParserConfigurationException
                | FileNotFoundException
                | XMLStreamException e) {
            throw new RuntimeException(e);
        }

    }

    private static Element readElementOnly(XMLStreamReader streamReader, Document document) {

        Element workingElement = document.createElement(streamReader.getLocalName());

        for (int attributeIndex = 0; attributeIndex < streamReader.getAttributeCount(); attributeIndex++) {
            workingElement.setAttribute(
                    streamReader.getAttributeLocalName(attributeIndex),
                    streamReader.getAttributeValue(attributeIndex));

        }
        return workingElement;
    }

    private static void workingLoop(final XMLStreamReader streamReader, final Document document, final Element fragmentRoot)
            throws XMLStreamException {

        Element startElement = readElementOnly(streamReader, document);
        fragmentRoot.appendChild(startElement);

        Element currentElement = startElement;

        while (streamReader.hasNext()) {
            int event = streamReader.next();
            switch (event) {
            case START_DOCUMENT:
                continue;
            case START_ELEMENT:

                Element workingElement = readElementOnly(streamReader, document);
                currentElement.appendChild(workingElement);
                currentElement = workingElement;

                continue;
            case CHARACTERS:
                currentElement.setTextContent(streamReader.getText());
                continue;

            case END_ELEMENT:
                if (currentElement != startElement) {
                    currentElement = (Element) currentElement.getParentNode();
                    continue;

                } else {

                    handleDocument(document, startElement);

                    fragmentRoot.removeChild(startElement);
                    startElement = null;
                    return;
                }

            }

        }

    }

    // THIS FUNCTION DOES SOMETHING MEANINFUL
    private static void handleDocument(Document document, Element startElement) {

        System.out.println(stringify(document));
    }

    private static String stringify(Document document) {

        try {
            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");

            StreamResult result = new StreamResult(new StringWriter());
            DOMSource source = new DOMSource(document);

            transformer.transform(source, result);

            String xmlString = result.getWriter().toString();
            return xmlString;

        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

}

EDIT: I made an incredibly silly mistake. 编辑:我犯了一个非常愚蠢的错误。 It's fixed now. 现在已修复。 It's working but imperfect -- should be enough to lead you in a useful direction. 它正在运行,但并不完美-应该足以引导您朝着有用的方向发展。

Consider using an XSLT 3.0 streaming transformation of the form: 考虑使用以下形式的XSLT 3.0流转换:

<xsl:template name="main">
  <xsl:stream href="bigInput.xml">
    <xsl:for-each select="copy-of(/records/REC)">
      <!-- process one record -->
    </xsl:for-each>
  </xsl:stream>
</xsl:template>

You can process this using Saxon-EE 9.6. 您可以使用Saxon-EE 9.6进行处理。

The "process one record" logic could use the Saxon SQL extension, or it could invoke an extension function: the context node will be a REC element with its contained tree, fully navigable within the subtree, but with no ability to navigate outside the REC element currently being processed. “处理一条记录”逻辑可以使用Saxon SQL扩展,也可以调用扩展功能:上下文节点将是一个REC元素,其包含的树可在子树中完全导航,但无法在REC之外导航当前正在处理的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM