简体   繁体   中英

How to improve StAX xml parser speed in java?

I have a XML parser using StAX and I am using it to parse a huge file. However, I want to bring the time down as low as possible. I am reading the values putting it into an array and sending it off to another function to evaluate. I am calling the displayName tag and it should go to the next xml as soon as it grabs the name instead of reading the whole xml file. I am looking for the fastest approach.

Java:


import java.io.File;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Iterator;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.*;

public class Driver {

    private static boolean bname;

    public static void main(String[] args) throws FileNotFoundException, XMLStreamException {

        File file = new File("C:\\Users\\Robert\\Desktop\\root\\SDKCode\\src\\main\\java\\com\\example\\xmlClass\\data.xml");


        parser(file);
    }

    public static void parser(File file) throws FileNotFoundException, XMLStreamException {

        bname = false;


        XMLInputFactory factory = XMLInputFactory.newInstance();


        XMLEventReader eventReader = factory.createXMLEventReader(new FileReader(file));


        while (eventReader.hasNext()) {

            XMLEvent event = eventReader.nextEvent();

            // This will trigger when the tag is of type <...>
            if (event.isStartElement()) {
                StartElement element = (StartElement) event;


                Iterator<Attribute> iterator = element.getAttributes();
                while (iterator.hasNext()) {
                    Attribute attribute = iterator.next();
                    QName name = attribute.getName();
                    String value = attribute.getValue();
                    System.out.println(name + " = " + value);
                }


                if (element.getName().toString().equalsIgnoreCase("displayName")) {
                    bname = true;
                }

            }


            if (event.isEndElement()) {
                EndElement element = (EndElement) event;


                if (element.getName().toString().equalsIgnoreCase("displayName")) {
                    bname = false;
                }


            }


            if (event.isCharacters()) {
                // Depending upon the tag opened the data is retrieved .
                Characters element = (Characters) event;

                if (bname) {
                    System.out.println(element.getData());
                }

            }
        }
    }
}

XML:

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername1.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername2.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername3.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>


etc...

There are a few ways going forward.

Splitting the file

First if your huge file is actually several concatenated XML files (as the sample you have shown), then this huge file is not a (valid) XML file and I advise splitting it before handling to a strict XML parsing library (Stax, DOM, SAX, XSL, whatever...).

A valid XML file only has one prolog and one root element.

You could use the XML prolog as a split marker, using pure IO / byte level APIs (no XML involved).

Each one of the splits can then be treated as a single XML "file" (independently if need be, for multithreading purposes). I do not mean "file" litterally, it could be a chunk of byte[] split from the original "huge file".

Speeding up XML Parsing

About your code

Using XMLEventReader , there are a few thing in your sample code that stick out.

  1. You should not iterate one the attributes as you do. Unless I'm missing something, you are not doing anything with this iteration.
  2. Once you are at the START_ELEMENT whose localName is displayName , you should call getElementText , which, being internal to the parser, has a few optimization tricks for speed that your while loop can not achieve. This call will leave the reader at the matching END_ELEMENT , so in effect, you simplify your code quite a bit (only check the displayName START_ELEMENT , that's all).
  3. Your XML seems well formed, so you can skip parsing as soon has you found a result
  4. XMLInputFactories are meant to be reused, so do not create one per file, create one shared instance.
  5. XML(xxx)Reader are closeable, so close them.
  6. Some XML libraries have faster character decoding schemes as the ones the JDK provide (knowing the internal of XML encodings allows them that), so if you have a valid XML prolog decalring the encoding at the beginning of the file, you should feed your factory with a File object or an InputStream , and not a Reader

Switching to XMLStreamReader

Other than that, you'd get faster performance out of XMLStreamReader than XMLEventReader . This is because XMLEvent instances are costly, thanks to their ability to stay useable even if the parser that created them has moved on. This means a XMLEvent is relatively heavyweight, that holds every possible bit of information relevent at the time of its creation (the namespace context, all attributes, ...), which has a cost to build, and a cost to hold in memory.

Events may be cached and referenced after the parse has completed.

XMLStreamReader does not emit any event, so does not pay this price. Seeing you only need to read a text value and has no use for the XMLEvent after its parsing, the stream reader will yield better performance.

Switching to a faster XMLStreamReader

Last time I checked (a bit too long ago), Woodstox was quite faster than the JDK standard Stax implementation (derived from Apache Xerces). But there might be faster kids around .

Try something else XML ?

I highly doubt you'd get faster performance out of any other parsing technology (SAX is usually equivalent, but you do not really have to option to quit the parsing as soon as you found your relevent tag). XSLT is pretty fast, but the amount of power it show comes with a performance price (usually some kind of lightweight DOM tree is built). So same goes for XPath, the expressiveness of the expressions usually imply some kind of complex structure being kept underneath. DOM is, of course, generally much slower.

What about not doing XML ?

It should probably be used only as a last resort, if every other bit of optimization has already been pulled, and you know for a fact that your XML processing is the bottleneck (not the IOs, not anything else, just the XML processing in and of itself).

As @MichaelKay states in the comments, not using XML tools may break at any point in the future because the way the files are created, while being completely equivalent in XML, might evolve and break a simple text based tool.

Using purely text based tools, you might get fooled by a change in the namespace declarations, varying line breaks, HTML entities encoding, external references, and many other XML specific subtelties, to get a fraction of extra performance.

Multi-threading your process

The use of multithreading could be a solution but it is not without caveats.

If your process runs in a typical EE server implementation, with advanced configurations and any kind of decent load, multithreading is not always a win because the system may already be lacking resources to spawn additionnal threads, and/or you may be defeating internal optimizations of the server by creating threads outside of its managed facilities.

If your process is a so-called lightweight application, or if its typical usage entails only a few users using it simultaneously, it is less likely that you would run into such issues and you might consider spawning an ExecutorService to do the XML parsing in parallel.

Another thing to consider is the IO. The XML Processing of individual files, CPU-wise, should profit as much as possible from the parallelisation of the parsing. But you might be bottlenecked by other parts of the process, usually IOs. If you can parse XML faster in a single CPU than you can pull data out of the disk, then parallelisation is of no use, you'd get many threads waiting for the disk, which might starve your system for not much (if anything). So you have to tune accordingly.

Changing the process

If you're stuck at reading a "huge file" or thousands of small files in a single unit of work, it might be a good opportunity to step back and look at your process.

  1. Reading thousands of small files have a cost in terms of IO and system calls, which in effect are blocking calls . Your java process has to wait for data comming out of the system-level stuff. If you have a way to minimise the number of system calls (open less files, using larger buffers...) this could be a win. I mean : reading a single tar file (containing 2000 small xml - a few kbs - files) can usually be achived faster than reading 2000 individual files.

  2. Doing the work pre-emptively / on the fly. Why would you wait untill the user asks for the data to parse the XMLs ? Would it not be possible to parse it as soon as the data arrives in the system (maybe asynchronously ?). That woul save you the trouble to read data from the disk, and might give you a chance to plug-in to a process that would have parsed the file anyway, saving time on both occasions. And then, you'd only have to query for the results (in a database of sorts) when the user request comes ?

Going forward

You can not build performance without measuring stuff.

So : measure.

How much does the IO cost ?

How much does the XML processing cost ? And what part of it ? (In your sample code, just the useless initialization of a XMLInputFactory` per file means there is a LOT to be gained if you had just measured it with a profiler)

How much does the other stuff in your service call cost ? (Do you connect to a DB before / after the call ? At each file ? Could that be done differently).

If you are still stuck, you may edit your question with those findings, to get further help.

As I can see multiple xml files for parsing, you can use multi threading to parse 3 xml files at a time and store the object either inside a thread-safe list like CopyOnWriteArrayList or thread-safe Map like Concurrent Hash Map. If you are parsing using Stax parser it is already optimized and it is used for bigger xml files. Besides if you do not require all the data from the XMl, you can use XPath, again XPath and Streaming XML parsing are different.

Where are the numbers? You can't tackle performance problems without measurements. What performance are you achieving? Is it chronically bad, or is it already close to the best you can reasonably expect?

There's only one performance "blunder" I can see in your code, and that's creating a new parser factory for each file (creating the factory is very expensive, it involves examining every JAR on the classpath). But then you confuse me: you say you are parsing one huge file (what does "huge" mean, actually?) but what you've shown seems to be a concatenation of many small XML documents. The two use cases are quite different from a performance point of view: with lots of small documents, initialising the parser is often a large part of the total cost.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM