简体   繁体   English

在Java中解析没有DOM的高度嵌套XML

[英]Parsing Highly Nested XML without DOM in Java

I've been tasked with fixing a rather irritating Heap out of memory issue. 我的任务是解决一个令人烦恼的“内存不足”问题。 IBM offers a Cognos SDK that we use with Java, and we query all of the packages stored on a content store, which are returned in an xml format. IBM提供了与Java一起使用的Cognos SDK,我们查询存储在内容存储库中的所有软件包,这些软件包以xml格式返回。 Then we parse that xml and write it to a sql database. 然后,我们解析该xml并将其写入sql数据库。 Profiling reveals that the worst memory issues are caused by Char[], which isn't very helpful (and the heaps are so large it's hard to profile), but does point towards the DOM parser. 分析显示,最糟糕的内存问题是由Char []引起的,这不是很有用(并且堆太大,很难进行剖析),但确实指向DOM解析器。

We're talking 500-1500 xml files (well, technically, XML text streams) that are absurdly deeply nested and vary in size and occasionally in structure. 我们正在谈论的是500-1500个xml文件(从技术上来说,是XML文本流),它们深深地嵌套在其中,并且大小和结构有时不尽相同。 Size varies from a few KB up to 30 MB in size, and the program will eat upwards of 8 GB of memory after about 300 packages. 大小从几KB到30 MB不等,在大约300个程序包之后,程序将占用8 GB以上的内存。 Programmer before me handled this by doing a manual System.gc call after every xml parse, which I wish to move away from (and it also doesn't actually solve the issue, just makes it viable on the smallest, 500 package server). 我之前的程序员通过在每次xml解析后进行一次手动System.gc调用来解决了这一问题,我希望摆脱这一问题(它实际上并不能解决问题,只是使其在最小的500包服务器上可行)。

I tried to use JAXB, but it has an odd structure that made it very difficult to use here (it has some "folder or querySubject" thing going on). 我尝试使用JAXB,但是它的结构很奇怪,因此在这里很难使用(发生了一些“文件夹或querySubject”问题)。 I tried STAX for several hours last week, but wasn't able to quite get working, same for WoodStox. 上周,我尝试了STAX几个小时,但无法完全正常工作,WoodStox也是如此。 I couldn't really find examples or tutorials on doing this for either. 我实际上都找不到执行此操作的示例或教程。 JDOM was what I examined next (as I've read that it has significantly better memory handling than pure DOM), but I can't figure out how to get it to parse quite as deeply as DOM. JDOM是我接下来要检查的内容(因为我已经读过它比纯DOM具有更好的内存处理能力),但是我不知道如何使它像DOM一样深入地解析。 Current DOM parsing : 当前的DOM解析:

            is = new ByteArrayInputStream(xml.getBytes("UTF-8"));
            xmlDoc = builder.parse(is);
            is.close();
        String _path, datatype, regularAggregate, description, formula;
        String table, tableLoc;

            NodeList elements = xmlDoc.getElementsByTagName("*");
            for (int j = 0; j < elements.getLength(); j++) {


                Element element = (Element) elements.item(j);
                String nodeName = element.getNodeName();
                if (nodeName=="queryItem" || nodeName=="measure"|| 
                nodeName=="calculation" || nodeName=="filter") {
                    if (element.hasAttribute("_path")) {
                    path = element.getAttribute("_path"));
                    } 

and so on for each attribute 对每个属性依此类推

My JDOM attempt. 我的JDOM尝试。 Currently, it only prints the root element, and I've yet to be able to go deeper than the first child layer : 目前,它只打印根元素,而我还不能深入到第一个子层:

SAXBuilder saxBuilder = new SAXBuilder();
Document document = saxBuilder.build(inputFile);

System.out.println("Root element :" + document.getRootElement().getName());
Element root = document.getRootElement();

List<Element> rList = root.getChildren("folder");

if (rList!= null) {
    for (Element node : rList) {
        List<Element> elements = node.getChildren("queryItem");
        if (elements!=null) {
            for (Element a:elements) {
            System.out.println(a.getAttribute("_path"));    
            }
            elements.size();
            rList.removeAll(elements);

        }
    }

Generated xsd structure of a random package: 生成的随机包的xsd结构:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="ResponseRoot">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="folder"/>
        <xs:element ref="package"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="package">
    <xs:complexType>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="screenTip" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="folder">
    <xs:complexType>
      <xs:sequence>
        <xs:choice minOccurs="0" maxOccurs="unbounded">
          <xs:element ref="folder"/>
          <xs:element ref="querySubject"/>
        </xs:choice>
        <xs:element minOccurs="0" maxOccurs="unbounded" ref="filter"/>
      </xs:sequence>
      <xs:attribute name="_path" use="required"/>
      <xs:attribute name="_ref" use="required"/>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="isNamespace" use="required" type="xs:integer"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="screenTip" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="querySubject">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" maxOccurs="unbounded" ref="queryItem"/>
        <xs:element minOccurs="0" maxOccurs="unbounded" ref="queryItemFolder"/>
      </xs:sequence>
      <xs:attribute name="_path" use="required"/>
      <xs:attribute name="_ref" use="required"/>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="screenTip" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="filter">
    <xs:complexType>
      <xs:attribute name="_path" use="required"/>
      <xs:attribute name="_ref" use="required"/>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="expression" use="required"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="screenTip" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="queryItem">
    <xs:complexType>
      <xs:attribute name="_path" use="required"/>
      <xs:attribute name="_ref" use="required"/>
      <xs:attribute name="currency" use="required"/>
      <xs:attribute name="datatype" use="required" type="xs:NCName"/>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="displayType" use="required" type="xs:NCName"/>
      <xs:attribute name="expression" use="required"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="promptCascadeOnRef" use="required"/>
      <xs:attribute name="promptDisplayItemRef" use="required"/>
      <xs:attribute name="promptFilterItemRef" use="required"/>
      <xs:attribute name="promptType" use="required" type="xs:NCName"/>
      <xs:attribute name="regularAggregate" use="required" type="xs:NCName"/>
      <xs:attribute name="screenTip" use="required"/>
      <xs:attribute name="unSortable" use="required" type="xs:integer"/>
      <xs:attribute name="usage" use="required" type="xs:NCName"/>
    </xs:complexType>
  </xs:element>
  xs:element name="queryItemFolder">
    <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:element ref="queryItem"/>
        <xs:element ref="queryItemFolder"/>
      </xs:choice>
      <xs:attribute name="_path" use="required"/>
      <xs:attribute name="_ref" use="required"/>
      <xs:attribute name="description" use="required"/>
      <xs:attribute name="name" use="required"/>
      <xs:attribute name="screenTip" use="required"/>
     </xs:complexType>
     </xs:element>
     </xs:schema>

For nested structures, it's easiest to manage if you create a method for each element type. 对于嵌套结构,如果为每种元素类型创建一个方法,则最容易管理。

Example

public static void main(String[] args) throws Exception {
    String xml = "<root>" +
                   "<folder name=\"A\">" +
                     "<folder name=\"B\">" +
                       "<book name=\"Learn Java\">" +
                         "<chapter name=\"Hello, World!\"/>" +
                         "<chapter name=\"Variables and Types\"/>" +
                       "</book>" +
                     "</folder>" +
                   "</folder>" +
                 "</root>";
    XMLInputFactory factory = XMLInputFactory.newFactory();
    XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(xml));
    try {
        reader.nextTag(); // Position on root element
        String tagName = reader.getLocalName();
        if (! tagName.equals("root"))
            throw new XMLStreamException("Expected <root> element, found: " + tagName, reader.getLocation());
        parseRoot(reader);
    } finally {
        reader.close();
    }
}

private static void parseRoot(XMLStreamReader reader) throws XMLStreamException {
    while (reader.nextTag() != XMLStreamConstants.END_ELEMENT) {
        String tagName = reader.getLocalName();
        if (tagName.equals("folder")) {
            parseFolder(reader, Collections.emptyList());
        } else {
            throw new XMLStreamException("Expected <folder> element, found: " + tagName, reader.getLocation());
        }
    }
}

private static void parseFolder(XMLStreamReader reader, List<String> parentPaths) throws XMLStreamException {
    String folderName = reader.getAttributeValue(null, "name");
    if (folderName == null)
        throw new XMLStreamException("Missing 'name' attribute on <folder> element", reader.getLocation());
    List<String> folderPath = new ArrayList<>(parentPaths.size() + 1);
    folderPath.addAll(parentPaths);
    folderPath.add(folderName);
    while (reader.nextTag() != XMLStreamConstants.END_ELEMENT) {
        String tagName = reader.getLocalName();
        if (tagName.equals("folder")) {
            parseFolder(reader, folderPath);
        } else if (tagName.equals("book")) {
            parseBook(reader, folderPath);
        } else {
            throw new XMLStreamException("Expected <folder> or <book> element, found: " + tagName, reader.getLocation());
        }
    }
}

private static void parseBook(XMLStreamReader reader, List<String> folderPath) throws XMLStreamException {
    String bookName = reader.getAttributeValue(null, "name");
    if (bookName == null)
        throw new XMLStreamException("Missing 'name' attribute on <book> element", reader.getLocation());
    while (reader.nextTag() != XMLStreamConstants.END_ELEMENT) {
        String tagName = reader.getLocalName();
        if (tagName.equals("chapter")) {
            parseChapter(reader, folderPath, bookName);
        } else {
            throw new XMLStreamException("Expected <chapter> element, found: " + tagName, reader.getLocation());
        }
    }
}

private static void parseChapter(XMLStreamReader reader, List<String> folderPath, String bookName) throws XMLStreamException {
    String chapterName = reader.getAttributeValue(null, "name");
    if (chapterName == null)
        throw new XMLStreamException("Missing 'name' attribute on <chapter> element", reader.getLocation());
    if (! reader.getElementText().isEmpty())
        throw new XMLStreamException("<chapter> element must be empty", reader.getLocation());
    System.out.println("Found:");
    System.out.println("  Folder:  " + folderPath);
    System.out.println("  Book:    " + bookName);
    System.out.println("  Chapter: " + chapterName);
}

Output 产量

Found:
  Folder:  [A, B]
  Book:    Learn Java
  Chapter: Hello, World!
Found:
  Folder:  [A, B]
  Book:    Learn Java
  Chapter: Variables and Types

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM