I have an XML file that can be as big as 1GB. I am using XOM to avoid OutOfMemory Exceptions.
I need to canonicalize the entire document, but the canonicalization takes a long time, even for a 1.5 MB file.
Here is what I have done:
I have this sample XML file and I increase the size of the document by replicating the Item node.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Packet id="some" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Head>
<PacketId>a34567890</PacketId>
<PacketHeadItem1>12345</PacketHeadItem1>
<PacketHeadItem2>1</PacketHeadItem2>
<PacketHeadItem3>18</PacketHeadItem3>
<PacketHeadItem4/>
<PacketHeadItem5>12082011111408</PacketHeadItem5>
<PacketHeadItem6>1</PacketHeadItem6>
</Head>
<List id="list">
<Item>
<Item1>item1</Item1>
<Item2>item2</Item2>
<Item3>item3</Item3>
<Item4>item4</Item4>
<Item5>item5</Item5>
<Item6>item6</Item6>
<Item7>item7</Item7>
</Item>
</List>
</Packet>
The code I am using for canonicalization is as follows:
private static void canonXOM() throws Exception {
String file = "D:\\PACKET.xml";
FileInputStream xmlFile = new FileInputStream(file);
Builder builder = new Builder(false);
Document doc = builder.build(xmlFile);
FileOutputStream fos = new FileOutputStream("D:\\canon.xml");
Canonicalizer outputter = new Canonicalizer(fos);
System.out.println("Query");
Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
System.out.println("Canon");
outputter.write(nodes);
fos.close();
}
Even though this code works well for small files, the canonicalization part takes about 7 minutes for a 1.5mb file on my development environment (4gb ram, 64bit, eclipse, windows)
Any pointers to the cause of this delay is highly appreciated.
PS. I need to canonicalize segments from a whole XML document, as well as the whole document itself. So, using the document itself as the argument does not work for me.
Best
memory is not restriction
main thread is green and no blocking. it is using as much cpu as it can.
because my machine has multi-cores , so the CPU total usage is not full.
But it will be full for a single CPU the main thread is running on.
Nodes.contains is the most busy one
internally nodes was managed in List, and compared linearly. More items in the List, the 'contains' will slower.
private final List nodes;
public boolean contains(Node node) {
return nodes.contains(node);
}
so
tool: JVisualVM. http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/index.html
Since you want the whole document serialized, can you just replace
Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);
with
outputter.write(doc);
?
It looks like Canonicalizer
does extra work (such as the nodes.contains()
calls mentioned by whunmr) when given a node list instead of just a root node to canonicalize.
If that doesn't work or is not enough, I would fork Canonicalizer
and make optimizations there as suggested by profiling.
I may have a solution to your problem, if you're willing to give up on XOM. My solution consists of using the XPath API and Apache Santuario .
The difference in performance is impressive, but I thought it would be good to provide a comparison.
For the tests I've used the XML file you provided in your question with 1.5MB.
FileInputStream xmlFile = new FileInputStream("input.xml");
Builder builder = new Builder(false);
Document doc = builder.build(xmlFile);
FileOutputStream fos = new FileOutputStream("output.xml");
nu.xom.canonical.Canonicalizer outputter = new nu.xom.canonical.Canonicalizer(fos);
Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);
fos.close();
org.apache.xml.security.Init.init();
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
org.w3c.dom.Document doc = builder.parse("input.xml");
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
org.w3c.dom.NodeList result = (org.w3c.dom.NodeList) xpath.evaluate("./descendant-or-self::node()|./@*", doc, XPathConstants.NODESET);
Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte canonXmlBytes[] = canon.canonicalizeXPathNodeSet(result);
IOUtils.write(canonXmlBytes, new FileOutputStream(new File("output.xml")));
Below is a table with the results in seconds. Tests were performed 16 times.
╔═════════════════╦═════════╦═══════════╗
║ Test ║ Average ║ Std. Dev. ║
╠═════════════════╬═════════╬═══════════╣
║ XOM ║ 140.433 ║ 4.851 ║
╠═════════════════╬═════════╬═══════════╣
║ XPath/Santuario ║ 2.4585 ║ 0.11187 ║
╚═════════════════╩═════════╩═══════════╝
The difference in performance is huge and it is related with the implementation of the XML Path Language . The downside of using XPath/Santuario is that they're not as simple as XOM.
Machine: Intel Core i5 4GB RAM
SO: Debian 6.0 64bit
Java: OpenJDK 1.6.0_18 64bit
XOM: 1.2.8
Apache Santuario: 1.5.3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.