简体   繁体   中英

java- how to process large XML files using Saxon library

In an app I am working on, I have to process very large XML files (files as much as 2GB in size)...I want to run some XQuery commands against those files using the Saxon java library.

How do I do this, in such a way that at a time only a small set of records in the file is kept in memory, and the file is processed in such small sets of data (rather than whole file at once)-- and at the same time, the XQuery command's output should be correct? I would prefer to use machines with only 0.5GB RAM to run the XQuery commands--> so its just not possible to load the entire XML into memory at once.

Saxon's support for streamed processing is actually stronger in XSLT than in XQuery, largely because the XSLT working group has been addressing this issue in designing XSLT 3.0. You can find information on the streaming capabilities of the product at

http://www.saxonica.com/documentation9.4-demo/index.html#!sourcedocs/streaming

Note these are available only in the commercial edition, Saxon-EE.

For simple "burst mode" streaming you can do things like:

for $e in saxon:stream(doc('big.xml')/*/record[@field='234']) return $e/name

By "burst mode" I essentially mean a query that operates over a large number of small disjoint subtrees of the source document.

The best way (but complicated) to reach such functionality is to limit possible XQuery commands (ie enumerate all possible use cases). After that once for every file process it using SAX or StAX way to create an internal "index" for whole XML file, that maps search keys to offsets (start and finish) in XML file. Those offsets should point to some small, but well-formed part of XML file, that can be loaded standalone and analyzed to check if it is matches specified XQuery.

Alternative way is to parse (again with SAX or StAX) XML file into some disk-based temporary database (like Apache Derby) and create your own XQuery => SQL translator OR interpretator to access this file data. You won't get OutOfMemoryException, but perfomance of such method... may be not the best for once-used files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM