简体   繁体   中英

DOM vs SAX XML parsing for large files

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

  • If I use SAX , I would have to iterate through 1.5 millions lines of code, node by node.
  • If I use DOM , I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.

Meet StAX

Just like SAX , StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM 's bidirectional read/write support, its ease of use and SAX 's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

You want SAX, most likely.

DOM is not necessarily faster; it might well me slower, if it works at all, and, as you say, you would need to hold a LOT in memory, probably needlessly.

OWL XML syntax is reasonably flat, but contains lots of cross-references.

If you need to resolve the cross-references, then a streaming approach (like SAX or StAX) isn't feasible; you will need to build a data structure in memory that holds the whole tree. If you're going to use an in-memory tree, don't use DOM, use one of the more modern models such as JDOM2 or XOM - they are more efficient and more usable.

If a streaming approach is feasible - that is, if there's a very direct correspondence between your input and output, then StAX is easier to work with than SAX because you can save the current state in variables on the Java stack, rather than needing complex data structures to maintain state between calls.

However, there's an alternative; you could write the whole thing in streaming XSLT 3.0. To be honest, this is bleeding edge and your learning time would probably be a lot greater; and it's not open-source; but you might well end up with a solution in 10 lines of code rather than 300.

There are other streaming technologies I haven't tried, like XStream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM