简体繁体 English

通过DOM解析器编辑BIG XML

[英]Editing a BIG XML via DOM parser

原文 2011-09-25 16:48:27 6 5 java/ xml/ parsing/ memory

If there is a very big XML and DOM parser is used to parse it. 如果有很大的XML，则使用DOM解析器进行解析。 Now there is a requirement to add/delete elements from the XML ie edit the XML How to edit the XML as the entire XML will not be loaded due to memory constraints ? 现在需要从XML中添加/删除元素，即编辑XML如何编辑XML，因为由于内存限制而不会加载整个XML？ What could be the strategy to solve this ? 解决这个问题的策略是什么？

5 个解决方案

You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. 您可能会考虑使用SAX解析器，它不会将整个文档都保留在内存中。 It will be faster and will also use much less memory. 它将更快，并且将使用更少的内存。

As two other answers mentioned already, a SAX parser will do the trick. 正如已经提到的其他两个答案一样，SAX解析器可以解决问题。 Your other alternative to DOM is a StAX parser . 替代DOM的另一个方法是StAX解析器。

Traditionally, XML APIs are either: 传统上，XML API是：

DOM based - the entire document is read into memory as a tree structure for random access by the calling application 基于DOM-整个文档作为树结构读入内存，供调用应用程序随机访问

event based - the application registers to receive events as entities are encountered within the source document. 基于事件-应用程序注册为在源文档中遇到实体时接收事件。

Both have advantages; 两者都有优势。 the former (for example, DOM) allows for random access to the document, the latter (eg SAX) requires a small memory footprint and is typically much faster. 前者（例如DOM）允许随机访问文档，后者（例如SAX）需要较小的内存占用空间，并且通常要快得多。

These two access metaphors can be thought of as polar opposites. 可以将这两个访问隐喻视为相反的对立面。 A tree based API allows unlimited, random access and manipulation, while an event based API is a 'one shot' pass through the source document. 基于树的API允许无限，随机的访问和操作，而基于事件的API是通过源文档进行的“一次性操作”。

StAX was designed as a median between these two opposites. StAX被设计为这两个对立面之间的中位数。 In the StAX metaphor, the programmatic entry point is a cursor that represents a point within the document. 在StAX隐喻中，程序化入口点是一个光标，表示文档中的一个点。 The application moves the cursor forward - 'pulling' the information from the parser as it needs. 应用程序将光标向前移动-根据需要从解析器中“拉出”信息。 This is different from an event based API - such as SAX - which 'pushes' data to the application - requiring the application to maintain state between events as necessary to keep track of location within the document. 这与基于事件的API（例如SAX）不同，后者将数据“推送”到应用程序-要求应用程序在必要时维护事件之间的状态以跟踪文档中的位置。

StAX is my preferred approach for handling large documents. StAX是我处理大型文档的首选方法。 If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes: 如果需要DOM，请查看支持懒惰地构造DOM节点的DOM实现（例如Xerces）：

http://xerces.apache.org/xerces-j/faq-write.html#faq-4 http://xerces.apache.org/xerces-j/faq-write.html#faq-4

Your assumption of memory constraint loading the XML document may only apply to DOM. 您对加载XML文档的内存约束的假设可能仅适用于DOM。 VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance... VTD-XML将整个XML加载到内存中，并高效地进行处理（是XML文档大小的1.3倍）...在内存和性能方面...

http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf

Another distinct benefit, which none other XML framework in existence has, is its incremental update capability... 现有的XML框架所不具备的另一个独特优势是其增量更新功能...

http://www.devx.com/xml/Article/36379 http://www.devx.com/xml/Article/36379

As stivlo mentioned you can use a SAX parser for reading the XML. 如stivlo所述，您可以使用SAX解析器来读取XML。

But for writing the XML you can write into fileoutput stream as plain text. 但是对于编写XML，您可以将其作为纯文本写入文件输出流。 I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted. 我确信您将获得要求在新标签后的哪个标签或标签下插入新数据的要求。