简体繁体中英

DOM vs SAX XML parsing for large files

原文 2013-06-26 02:21:35 2 3 java/ javascript/ xml/ parsing/ dom

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

SAX allows for the document to be read node by node, so the whole document is not in memory.
DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

If I use SAX , I would have to iterate through 1.5 millions lines of code, node by node.
If I use DOM , I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.

3 answers

Meet StAX

Just like SAX , StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM 's bidirectional read/write support, its ease of use and SAX 's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

You want SAX, most likely.

DOM is not necessarily faster; it might well me slower, if it works at all, and, as you say, you would need to hold a LOT in memory, probably needlessly.

OWL XML syntax is reasonably flat, but contains lots of cross-references.

If you need to resolve the cross-references, then a streaming approach (like SAX or StAX) isn't feasible; you will need to build a data structure in memory that holds the whole tree. If you're going to use an in-memory tree, don't use DOM, use one of the more modern models such as JDOM2 or XOM - they are more efficient and more usable.

If a streaming approach is feasible - that is, if there's a very direct correspondence between your input and output, then StAX is easier to work with than SAX because you can save the current state in variables on the Java stack, rather than needing complex data structures to maintain state between calls.

However, there's an alternative; you could write the whole thing in streaming XSLT 3.0. To be honest, this is bleeding edge and your learning time would probably be a lot greater; and it's not open-source; but you might well end up with a solution in 10 lines of code rather than 300.

There are other streaming technologies I haven't tried, like XStream.

Android: DOM vs SAX vs XMLPullParser parsing?

Parsing DOM - org.xml.sax.SAXParseException

Parsing dblp.xml with java DOM/SAX

Loading local chunks in DOM while parsing a large XML file in SAX (Java)

Parsing Large XML File Using Sax

parsing large XML using SAX in java

XML parsing (SAX,StAX) vs Scanner

JAXB vs DOM and SAX

DOM vs SAX Java

Exception for no memory while parsing large XML file in SAX parser

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Android: DOM vs SAX vs XMLPullParser parsing? Parsing DOM - org.xml.sax.SAXParseException Parsing dblp.xml with java DOM/SAX Loading local chunks in DOM while parsing a large XML file in SAX (Java) Parsing Large XML File Using Sax parsing large XML using SAX in java XML parsing (SAX,StAX) vs Scanner JAXB vs DOM and SAX DOM vs SAX Java Exception for no memory while parsing large XML file in SAX parser

Related Tags

DOM vs SAX XML parsing for large files

Question

Background:

SAX vs DOM for large files:

Problem:

3 answers

solution1
5 ACCPTED 2013-06-26 03:20:13

Meet StAX

JAXP API comparison

solution2
2 2013-06-26 02:25:50

solution3
2 2013-06-26 08:09:59

DOM vs SAX XML parsing for large files

Question

Background:

SAX vs DOM for large files:

Problem:

3 answers

solution1 5 ACCPTED 2013-06-26 03:20:13

Meet StAX

JAXP API comparison

solution2 2 2013-06-26 02:25:50

solution3 2 2013-06-26 08:09:59

solution1
5 ACCPTED 2013-06-26 03:20:13

solution2
2 2013-06-26 02:25:50

solution3
2 2013-06-26 08:09:59