XOM canonicalization takes too long

Question

I have an XML file that can be as big as 1GB. I am using XOM to avoid OutOfMemory Exceptions.

I need to canonicalize the entire document, but the canonicalization takes a long time, even for a 1.5 MB file.

Here is what I have done:

I have this sample XML file and I increase the size of the document by replicating the Item node.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Packet id="some" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Head>
<PacketId>a34567890</PacketId>
<PacketHeadItem1>12345</PacketHeadItem1>
<PacketHeadItem2>1</PacketHeadItem2>
<PacketHeadItem3>18</PacketHeadItem3>
<PacketHeadItem4/>
<PacketHeadItem5>12082011111408</PacketHeadItem5>
<PacketHeadItem6>1</PacketHeadItem6>
</Head>
<List id="list">
    <Item>
        <Item1>item1</Item1>
        <Item2>item2</Item2>
        <Item3>item3</Item3>
        <Item4>item4</Item4>
        <Item5>item5</Item5>
        <Item6>item6</Item6>
        <Item7>item7</Item7>
    </Item>
</List>
</Packet>

The code I am using for canonicalization is as follows:

private static void canonXOM() throws Exception {
    String file = "D:\\PACKET.xml";
    FileInputStream xmlFile = new FileInputStream(file);

    Builder builder = new Builder(false);
    Document doc = builder.build(xmlFile);

    FileOutputStream fos = new FileOutputStream("D:\\canon.xml");
    Canonicalizer outputter = new Canonicalizer(fos);

    System.out.println("Query");
    Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");

    System.out.println("Canon");
    outputter.write(nodes);

    fos.close();
}

Even though this code works well for small files, the canonicalization part takes about 7 minutes for a 1.5mb file on my development environment (4gb ram, 64bit, eclipse, windows)

Any pointers to the cause of this delay is highly appreciated.

PS. I need to canonicalize segments from a whole XML document, as well as the whole document itself. So, using the document itself as the argument does not work for me.

Best

Answer 1

内存不是限制

memory is not restriction

主线程为绿色且无阻塞

main thread is green and no blocking. it is using as much cpu as it can. 
because my machine has multi-cores , so the CPU total usage is not full.
But it will be full for a single CPU the main thread is running on.

Nodes.contains是最忙的一个

Nodes.contains is the most busy one

internally nodes was managed in List, and compared linearly. More items in the List, the 'contains' will slower.

private final List nodes;
public boolean contains(Node node) {
    return nodes.contains(node);
}

so

try to modify the lib's code to using HashMap to hold the nodes.
or using multiple-thread to utilize more CPUs, if your XML can be splited into small xmls.

tool: JVisualVM. http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/index.html

Answer 2

Since you want the whole document serialized, can you just replace

Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);

with

outputter.write(doc);

?

It looks like Canonicalizer does extra work (such as the nodes.contains() calls mentioned by whunmr) when given a node list instead of just a root node to canonicalize.

If that doesn't work or is not enough, I would fork Canonicalizer and make optimizations there as suggested by profiling.

Answer 3

I may have a solution to your problem, if you're willing to give up on XOM. My solution consists of using the XPath API and Apache Santuario .

The difference in performance is impressive, but I thought it would be good to provide a comparison.

For the tests I've used the XML file you provided in your question with 1.5MB.

The XOM Test

FileInputStream xmlFile = new FileInputStream("input.xml");

Builder builder = new Builder(false);
Document doc = builder.build(xmlFile);

FileOutputStream fos = new FileOutputStream("output.xml");
nu.xom.canonical.Canonicalizer outputter = new nu.xom.canonical.Canonicalizer(fos);

Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);

fos.close();

The XPath/Santuario Test

org.apache.xml.security.Init.init();

DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
org.w3c.dom.Document doc = builder.parse("input.xml");

XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();

org.w3c.dom.NodeList result = (org.w3c.dom.NodeList) xpath.evaluate("./descendant-or-self::node()|./@*", doc, XPathConstants.NODESET);

Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte canonXmlBytes[] = canon.canonicalizeXPathNodeSet(result);

IOUtils.write(canonXmlBytes, new FileOutputStream(new File("output.xml")));

The Results

图形结果

Below is a table with the results in seconds. Tests were performed 16 times.

╔═════════════════╦═════════╦═══════════╗
║      Test       ║ Average ║ Std. Dev. ║
╠═════════════════╬═════════╬═══════════╣
║ XOM             ║ 140.433 ║   4.851   ║
╠═════════════════╬═════════╬═══════════╣
║ XPath/Santuario ║ 2.4585  ║  0.11187  ║
╚═════════════════╩═════════╩═══════════╝

The difference in performance is huge and it is related with the implementation of the XML Path Language . The downside of using XPath/Santuario is that they're not as simple as XOM.

Test Details

Machine: Intel Core i5 4GB RAM
SO: Debian 6.0 64bit
Java: OpenJDK 1.6.0_18 64bit
XOM: 1.2.8
Apache Santuario: 1.5.3

XOM canonicalization takes too long

Question

3 answers

solution1
1 2012-12-05 10:37:49

solution2
0 2012-12-05 13:58:34

solution3
0 2012-12-06 14:12:11

The XOM Test

The XPath/Santuario Test

The Results

Test Details

XOM canonicalization takes too long

Question

3 answers

solution1 1 2012-12-05 10:37:49

solution2 0 2012-12-05 13:58:34

solution3 0 2012-12-06 14:12:11

The XOM Test

The XPath/Santuario Test

The Results

Test Details

solution1
1 2012-12-05 10:37:49

solution2
0 2012-12-05 13:58:34

solution3
0 2012-12-06 14:12:11