简体   繁体   中英

xPath multiple xml files from different url very slow

I need to check only one node from each file (109 files) that they are stored on different urls (109 urls). I use this code

public class XPathParserXML {
public String version(String link, String serial) throws SAXException, IOException, 
ParserConfigurationException, XPathExpressionException{
    String version = new String();
    String url = link+serial;
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = builder.parse(url);        
    XPath xPathFactory = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xPathFactory.compile("//swVersion/text()");
    Object result = expr.evaluate(doc, XPathConstants.NODESET);
    NodeList node = (NodeList) result;
    if (node == null){
        version = "!!WORKING!!";
    }else{
        version = node.item(0).getNodeValue();
    }
    return version;
}
}

and i call the method "version(link,serial)" in cicle for 109 times

My code take like 20 seconds to elaborate all. Each file weight 0.64KB and i have a 20MB connection.

What can i do to speed up my code?

1. Object caching:

While that's not the only issue, probably, you should definitely cache and reuse all of those objects between calls to version() :

  • DocumentBuilderFactory
  • DocumentBuilder
  • XPathFactory
  • XPathExpression

2. Circumvention of a known JAXP performance issue:

Besides, you should probably activate one of these flags:

-Dorg.apache.xml.dtm.DTMManager=
  org.apache.xml.dtm.ref.DTMManagerDefault

or

-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
  com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault

See also this question for details:

Java XPath (Apache JAXP implementation) performance

3. Reduce latency impact

Last but not least, you're serially accessing all those XML files over the wire. It may be useful to reduce the impact of your connection latency by parallelising access to those files, eg by using multiple threads at the client side. (Note if you choose multi-threading, then beware of thread-safety issues when caching the objects I've mentioned in the first section. Also, avoid creating too many parallel requests at the same time to prevent your server from failing)

Another way to reduce that impact would be to expose those XML files in a ZIP file from the server to avoid multiple connections and transfer all XML files at once.

4. Avoid XML validation if you can trust the source

From your additional comments, I see that you're using XML validation. This is, of course, expensive and should only be done if really needed. Since you run a very arbitrary XPath expression, I take that you don't care too much about XML validation. Best turn it off!

5. If all else fails... Avoid DOM

Since (from your comments) you've measured the parsing to take up most of the CPU, you have two more options to circumvent the whole issue:

  • Use a SAX parser and abort parsing once you reach the //swVersion element (From your code, I'm assuming that there is only one). SAX is much faster for these use-cases, than DOM.
  • Avoid XML entirely and search the document for a regex: <swVersion>(.*?)</swVersion> . That should only be your last resort, because it doesn't handle
    • namespaces
    • attributes
    • whitespace

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM