简体   繁体   English

如何以流式方式遍历大型XML中的节点?

[英]How do I iterate over nodes in a huge XML in a streaming fashion?

I have a gigantic XML file, like this: 我有一个巨大的XML文件,如下所示:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
   </book>
   [... one gazillion more entries ...]
</catalog>

I want to iterate over this file in a streaming fashion, so that I never have to load the whole thing into memory, something like: 我想以流方式遍历此文件,这样就不必将整个内容加载到内存中,例如:

InputStream stream = new FileInputStream("gigantic-book-list.xml");
String nodeName = "book";
Iterator it = new StreamingXmlIterator(stream, nodeName);
Document bk101 = it.next();
Document bk102 = it.next();

Also, I'd like this to work with different XML input files, without having to create specific objects (eg Book.java). 另外,我希望它可以与其他XML输入文件一起使用,而不必创建特定的对象(例如Book.java)。

@McDowell has a promising approach that use XMLStreamReader and StreamFilter at https://stackoverflow.com/a/16799693/13365 , but that only extracts a single node. @McDowell有一种很有前途的方法,该方法使用https://stackoverflow.com/a/16799693/13365的 XMLStreamReaderStreamFilter ,但是仅提取单个节点。

Also, Camel's .tokenizeXML does exactly what I want, so I guess I should look into the source code. 另外, Camel的.tokenizeXML完全可以实现我想要的功能,所以我想我应该看一下源代码。

@XmlRootElement
public class Book {
  // TODO: getters/setters
  public String author;
  public String title;
}

Assuming you want to process data as strongly typed objects you can combine StAX and JAXB using utility types: 假设您想将数据作为强类型对象处理,则可以使用实用程序类型将StAX和JAXB结合使用:

  class ContentFinder implements StreamFilter {
    private boolean capture = false;

    @Override
    public boolean accept(XMLStreamReader xml) {
      if (xml.isStartElement() && "book".equals(xml.getLocalName())) {
        capture = true;
      } else if (xml.isEndElement() && "book".equals(xml.getLocalName())) {
        capture = false;
        return true;
      }
      return capture;
    }
  }

  class Limiter extends StreamReaderDelegate {
    Limiter(XMLStreamReader xml) {
      super(xml);
    }

    @Override
    public boolean hasNext() throws XMLStreamException {
      return !(getParent().isEndElement()
               && "book".equals(getParent().getLocalName()));
    }
  }

Usage: 用法:

XMLInputFactory inFactory = XMLInputFactory.newFactory();
XMLStreamReader reader = inFactory.createXMLStreamReader(inputStream);
reader = inFactory.createFilteredReader(reader, new ContentFinder());
Unmarshaller unmar = JAXBContext.newInstance(Book.class)
    .createUnmarshaller();
Transformer tformer = TransformerFactory.newInstance().newTransformer();
while (reader.hasNext()) {
  XMLStreamReader limiter = new Limiter(reader);
  Source src = new StAXSource(limiter);
  DOMResult res = new DOMResult();
  tformer.transform(src, res);
  Book book = (Book) unmar.unmarshal(res.getNode());
  System.out.println(book.title);
}

Isn't this precisely what the SAX API achieves ? 这正是SAX API所能实现的吗?

SAX parsers have some benefits over DOM-style parsers. SAX解析器比DOM样式的解析器具有一些优势。 A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). SAX解析器仅需要在每个解析事件发生时报告它,并且通常会在报告后丢弃几乎所有的信息(但是,它确实保留了一些东西,例如按顺序列出了所有尚未关闭的元素)。以捕获后续错误,例如以错误的顺序结束标签)。 Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (ie, of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.). 因此,SAX解析器所需的最小内存与XML文件(即XML树)的最大深度和单个XML事件所涉及的最大数据(例如单个start-的名称和属性)成正比。标签或处理指令的内容等)。

I think you need to simply track each book startElement() call, and record the incoming elements/attributes from there. 我认为您只需跟踪每本书的startElement()调用,并从那里记录传入的元素/属性。 Process upon receipt of the corresponding endElement() call. 在收到相应的endElement()调用后进行处理。 Remember that characters() can be called multiple times across the same text node. 请记住,可以在同一文本节点上多次调用characters()

Use SAX parser then. 然后使用SAX解析器。 Check SAX parser tutorial from Oracle 从Oracle检查SAX解析器教程

You need to describe what the desired output of your process is, and what your technology constraints are. 您需要描述过程的期望输出是什么,以及技术限制是什么。

Streaming in XSLT 3.0 is still bleeding edge, but many transformations can be expressed very easily. XSLT 3.0中的流传输仍然处于前沿,但是可以非常轻松地表达许多转换。 For example with Saxon-EE 9.5 you could compute the average price of the books in a streamed transformation as 例如,使用Saxon-EE 9.5,您可以在流式转换中计算书籍的平均价格为

<xsl:template name="main">
  <xsl:stream href="books.xml">
    <xsl:value-of select="avg(/books/book/price)"/>
  </xsl:stream>
</xsl:template>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM