简体   繁体   中英

Java Sax to parse complex large XML file

I am using SAX to parse some large XML files and I want to ask the following: The XML files have a complex structure. Something like the following:

<library>
    <books>
    <book>
        <title></title>
    <img>
        <name></name>
        <url></url>
    </img>
    ...
    ...
    </book>
    ...
    ...
</books>
<categories>
    <category id="abcd">
        <locations>
        <location>...</location>
    </locations>
    <url>...</url>
    </category>
    ...
    ... 
</categories>
<name>...</name>
<url>...</url>
</library>

The fact is that these files are over 50MB each and a lot of tags are repeated under different context, eg url under /books/book/img but also under /library and under /library/categories/category and so on.

My SAX parser uses a subclass of DefaultHandler in which I override teh startElement and the endElement methods (among others). But the problem is that these methods are huge in terms of lines of code due to the business logic of these XML files. I am using a lot of

if ("url".equalsIgnoreCase(qName)) {
    // peek at stack and if book is on top
    // ...
    // else if category is on top
    // ...
} else if (....) {
}

I was wondering whether there is a more proper / correct / elegant way to perform the xml parsing.

Thank you all

What you can do is implement separate ContentHandler for different contexts. For example write one for <books> , one for <categories> and one top-level one.

Then, as soon as the books startElement method is called, you immediately switch the ContentHandler using XMLReader.setContentHandler() . Then the <books> specific ContentHandler switches back to the top-level handler to when its endElement method is called for books .

This way each ContentHandler can focus on his particular part of the XML and need not know about all the other parts.

The only ugly-ish part is that the specific handlers need to know of the top-level handler and when to switch back to it, which can be worked around by providing a simple "handler stack" that handles that for you.

Not sure whether you're asking 1) is there something else you can do besides checking the tag against a bunch of strings or 2) if there's an alternative to a long if-then-else kind of statement.

The answer to 1 is not that I've found. Someone else may tackle that one.

The answer to 2 depends on your domain. One way I see is that if the point of this is to hydrate a bunch of objects from an XML file, then you can use a factory method.

So the first factory method has the long if then else statement that simply passes off the XML to the appropriate classes. Then each of your classes has a method like constructYourselfFromXmlString. This will improve your design because only the objects themselves know about the private data that is in an XML to hydrate them.

the reason this is hard is that, if you think about it, exporting an Object to XML and importing back in really violates encapsulation. Nothing to be done about it, just is. This at least makes things a little more encapsulated.

HTH

Agreeing with the sentiment that exporting an object to XML is a violation of encapsulation, the actual technique used to handle tags which are nested at different lengths isn't terribly difficult using SAX.

Basically, keep a StringBuffer which will maintain your "location" in the document, which will be a directory like representation of the nested tag you are currently within. For example, if at the moment the string buffer's contents are /library/book/img/url then you know it's an URL for an image in a book, and not a URL for some category.

Once you ensure that your "path tracking" algorithms are correct you can then wrap your object creation routines with better handling by using string matches. Instead of

if ("url".equalsIgnoreCase(qName)) {
   ...
}

you can now substitute

if (location.equalsIgnoreCase("/library/book/img/url")) {
   ...
}

If for some reason this doesn't appeal to you, there are still other solutions. For example, you can make a SAX handler which implements a stack of Handlers where the top handler is responsible for handling just it's portion of the XML document, and it pops itself off the stack once it is done. Using such a scheme, the each object gets created by its own unique individual handler, and some handlers basically check and direct which "object creation" handlers get shoved onto the handling stack at the appropriate times.

I've used both techniques. There are strengths in both, and which one is best really depends on the input and the needed objects.

You could refactor your SAX content handling so that you register a set of rules, each of which has a test that it applies to see if it matches the element, and an action that is executed if it does. This is moving closer to the XSLT processing model, while still doing streamed processing. Or you could move to XSLT - processing 50Mb input files is well within the capabilities of a modern XSLT processor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM